Newer
Older
Introgression patterns between house mouse subspecies and species reveal genomic windows of frequent exchange
Ullrich KK, Linnenbrink M, Tautz D
Genome mapping files for _Mus musculus domesticus GER_, _Mus musculus domesticus FRA_, _Mus musculus domesticus IRA_, _Mus musculus musculus AFG_, _Mus musculus castaneus CAS_ and _Mus spretus SPRE_ were obtained from <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_domesticus/genomes_bam/>, <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_musculus/genomes_bam/>, <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_castaneus/genomes_bam/>, <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_spretus/genomes_bam/>.
For mapping details please look into the original publication ([Harr et al. 2016](http://www.nature.com/articles/sdata201675)) <http://www.nature.com/article-assets/npg/sdata/2016/sdata201675/extref/sdata201675-s7.docx>.
## Get masking regions for individual samples and natural populations
For masking genomic regions in natural populations which showed low coverage based on the genomic mapping BAM files we only considered the stable chromosomes (chr1-chr19,chrX,chrY,chrM) from the reference GRCm38 _mm10_ <http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/>.
The BAM files were processed with 'genomeCoverageBed' from the bedtools software suite ([Quinlan and Hall 2010](https://academic.oup.com/bioinformatics/article/26/6/841/244688/BEDTools-a-flexible-suite-of-utilities-for)) to obtain site specific genome coverage and further united with 'unionBedGraphs'.
##### Population specific masking
The per population combined coverage was further processed to only retain regions with a coverage smaller than 5 resulting as the masking regions.
genomeCoverageBed example for the population _Mus musculus musculus AFG_:
```
#example for populaton Mmm_AFG:
#AFG1_396.bam
#AFG2_413.bam
#AFG3_416.bam
#AFG4_424.bam
#AFG5_435.bam
#AFG6_444.bam
for file in *.bam; do genomeCoverageBed -ibam $file -bga -g $REFFERENCE > $file".bga";done
```
unionBedGraphs example for the population _Mus musculus musculus AFG_:
```
INPUT1=AFG1_396.bam.bga
INPUT2=AFG2_413.bam.bga
INPUT3=AFG3_416.bam.bga
INPUT4=AFG4_424.bam.bga
INPUT5=AFG5_435.bam.bga
INPUT6=AFG6_444.bam.bga
OUTPUT=Mmm_AFG.combined.bga
unionBedGraphs -i $INPUT1 $INPUT2 $INPUT3 $INPUT4 $INPUT5 $INPUT6 | awk -v OFS='\t' 'BEGIN {sum=0} {for (i=4; i<=NF; i++) sum+=$i; print $1,$2,$3,sum; sum=0}' > $OUTPUT
```
get masking region example for the population _Mus musculus musculus AFG_:
```
INPUT=Mmm_AFG.combined.bga
OUTPUT=Mmm_AFG.combined.bga.stcov5
awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5"
bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge"
awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT
```
United masked files generated can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/masking/
##### Individual specific masking
For individuals regions with a coverage smaller than 5 were used as masking regions.
get masking region example for individual 396 of the population _Mus musculus musculus AFG_:
```
INPUT=AFG1_396.bam.bga
OUTPUT=AFG1_396.bam.bga.stcov5
awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5"
bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge"
awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT
```
+ bedtools v2.26.0 <http://bedtools.readthedocs.io/en/latest/>
For SNP and INDEL calling the BAM files were processed with 'samtools mpileup' and 'bcftools call' ([Li et al. 2011](https://academic.oup.com/bioinformatics/article/25/16/2078/204688/The-Sequence-Alignment-Map-format-and-SAMtools)) with relaxed quality options to retain information in CNV regions. Each chromosome was analyzed seperately.
samtools mpileup | bcftools call example for chromosome 1 for the population _Mus musculus musculus AFG_:
#AFG1_396.bam
#AFG2_413.bam
#AFG3_416.bam
#AFG4_424.bam
#AFG5_435.bam
#AFG6_444.bam
PLOIDY=ploidy.txt
#chrX 1 171031299 M 1
#chrY 1 91744698 M 1
#chrY 1 91744698 F 0
#chrM 1 16299 F 1
#chrM 1 16299 M 1
#* * * M 2
#* * * F 2
SAMPLES=AFG_samples.txt
#396 M
#413 M
#416 M
#424 M
#435 F
#444 M
OUTPUT=mpileup.q0Q10.chr1.Mmm_AFG.bcfcall.mv.vcf
INPUT1=AFG1_396.bam
INPUT2=AFG2_413.bam
INPUT3=AFG3_416.bam
INPUT4=AFG4_424.bam
INPUT5=AFG5_435.bam
INPUT6=AFG6_444.bam
echo $INPUT1 >> $BAMLIST
echo $INPUT2 >> $BAMLIST
echo $INPUT3 >> $BAMLIST
echo $INPUT4 >> $BAMLIST
echo $INPUT5 >> $BAMLIST
echo $INPUT6 >> $BAMLIST
samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -r chr1 -uf $REFERENCE -b $BAMLIST | bcftools call -O v -f GQ -m -v --ploidy-file $PLOIDY -S $SAMPLES > $OUTPUT
bgzip $OUTPUT
tabix $OUTPUT".gz"
+ samtools 1.3.1 (using htslib 1.3.1)
+ bcftools 1.3.1 (using htslib 1.3.1)
All mpileup generated for each population can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/mpileup/
NOTE: For each population all analyzed chromosomes were merged into one file.
## K80 distance calculation
### Get population specific SNPs
To get population specific CONSENSUS VCF files the VCF file produced with 'bcftools call' was first re-coded into population specific VCF files with 'vcftools' ([Danecek et al. 2011](https://academic.oup.com/bioinformatics/article/27/15/2156/402296/The-variant-call-format-and-VCFtools)). Further the population specific VCF file containing multiple individuals was parsed with 'vcfparser.py mvcf2consensus' to obtain a CONSENSUS VCF file for each population. This CONSENSUS VCF files were used to generate pseudo-genomes files per natural population with 'vcfparser.py vcf2fasta' using also the masking regions (see "Get masking regions for individual samples and natural populations").
vcftools example for chromosome 1 for the population _Mus musculus musculus AFG_:
#396
#413
#416
#424
#435
#444
POPIDS=AFG.vcf.ids
echo "396" >> $POPIDS
echo "413" >> $POPIDS
echo "416" >> $POPIDS
echo "424" >> $POPIDS
echo "435" >> $POPIDS
echo "444" >> $POPIDS
GZVCF=mpileup.q0Q10.chr1.Mmm_AFG.bcfcall.mv.vcf.gz
OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels
vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --keep $POPIDS --out $OUTPUT
```
All re-coded vcf files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/recode/
NOTE: For each population all analyzed chromosomes were merged into one file.
vcfparser.py mvcf2consensus example for the population _Mus musculus musculus AFG_:
```
#example for population Mmm_AFG:
#396
#413
#416
#424
#435
#444
INPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.vcf
OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus
python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1 -samples 396,413,416,424,435,444 -id Mmm_AFG.mv
All CONSENSUS vcf files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/consensus/
NOTE: For each population all analyzed chromosomes were merged into one file.
vcfparser.py vcf2fasta example for the population _Mus musculus musculus AFG_:
```
#example for population Mmm_AFG:
REFERENCE=mm10.fasta
INPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.vcf
OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr1
python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1 -ibga $MASKFILE -cov2N 4
All pseudo-genome FASTA files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/
+ vcftools v0.1.15
+ vcfparser.py <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/vcfparser.py>
### Calculate dK80 distance between populations using the CONSENSUS pseudo-genome files
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
To calculate dK80 distance between populations, quartets ([X],[Y],[Z],[O]) are used. Within the population quartets, dK80 is calculated for trios (e.g. [X],[Z],[O]) on non-verlapping sequence windows ($w$) throughout the investigated genome between this population triplet on a window ($w_{i}$) as:
\begin{equation}
dK80_{XZO_{w_{i}}} = d_{XO_{w_{i}}} - d_{XZ_{w_{i}}}
\end{equation}
where $d_{XO_{w_{i}}} and $d_{XZ_{w_{i}}} are defined as the average Kimura's 2-parameter sequence distance ([Kimura 1980](https://www.ncbi.nlm.nih.gov/pubmed/7463489)) between the corresponding two populations calculated with the function 'dist.dna' of the of the R package 'ape' ([Paradis et al. 2004](https://academic.oup.com/bioinformatics/article/20/2/289/204981/APE-Analyses-of-Phylogenetics-and-Evolution-in-R)) using the model 'K80'. Prior the calculation of dK80 all sites with missing data within the specified window ($w_{i}$) and the specified populations were removed across the whole quartet with the 'Biostrings' R package ([Pages et al. 2009](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)).
example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: _Mus musculus musculus AFG_:
```
#example for the quartet [X]: FRA; [Y]: GER; [Z]: IRA; [O]: AFG
#change the bottom part of the script 'get_dK80.r' for each chromosome and quartet
#here you can find the example for chromosome 1
popX <- "Mmd_FRA"
popY <- "Mmd_GER"
popZ <- "Mmd_IRA"
popO <- "Mmm_AFG"
TMP_DIR <- "/tmp"
popX.pos <- 2
popY.pos <- 3
popZ.pos <- 4
popO.pos <- 5
SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/CAS_FRA_GER_IRA_AFG_SPRE.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.refmajorsample.ref.consensus.fasta"
chr <- "chr1"
OUT_FILE <- paste0(TMP_DIR,"/",popX,"_",popY,".",popZ,".",popO,".",chr,".tsv")
WSIZE <- 25000
WJUMP <- 25000
DISTMODEL <- "K80"
```
All dK80 files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/browser_tracks/dK80/25kbp_sw/
NOTE: For each quartet comparison all analyzed chromosomes were merged into one file.
_used software:_
+ R version 3.4.1 (2017-06-30)
+ R package ape_4.1
+ R package Biostrings_2.40.2
+ R package S4Vectors_0.10.3
+ R package XVector_0.12.1
+ R package IRanges_2.6.1
+ R package BiocGenerics_0.18.0
To simulate genomes, first we estimated the number of pair-wise segregating sites with the CONSENSUS pseudo-genome files adding the reference mm10. Subsequently we used