[ad_1]
Data collection and whole-genome resequencing
Collections and library preparation
Adult butterflies were collected between 2009 and 2018 and stored at −20 °C in either salt-saturated dimethyl sulfoxide or 100% ethanol. RNA-free genomic DNA was extracted from the thorax of butterflies using Qiagen Blood and Tissue and E.Z.N.A Tissue DNA kits (Omega Bio-tek), and used to prepare 350-bp insert Illumina libraries for 33 individuals, which were sequenced using 100–150-bp paired-end sequencing on Illumina instruments. Collecting and export permit numbers are provided in the Acknowledgements. We complemented these samples with previously published sequences (see Supplementary Table 1 for sample details).
Read filtering, mapping and genotype calling
After demultiplexing, reads were filtered for Illumina adapters using cutadapt v.1.8.1 (ref. 52) and then mapped to the H. melpomene assembly v.2.5 (Hmel2.5, ref. 53)(ref. 54) using BWA-MEM v.0.7.15 (ref. 55) with default parameters and marking short split hits as secondary. Mapped reads were sorted and duplicate reads removed using sambamba v.0.6.8 (ref. 56) sort and markdup modules, respectively. Mapped reads were further realigned around indels using the Genome Analysis Toolkit (GATK) v.3.8 RealignerTargetCreator and IndelRealigner modules57,58, to reduce the number of indel miscalls. Read depth and other relevant read alignment quality control metrics were computed using QualiMap v.2.2.1 (ref. 59).
Genotype calling was performed using the bcftools v.1.5 (ref. 60) mpileup and call modules, requiring a minimum MQ (mapping quality) and QUAL (base quality) of 20. Genotyping was performed jointly for individuals belonging to the same population using the multiallelic and rare-variant calling option (-m) in bcftools call. Ploidy aware genotype calling was performed for the Z chromosome. Genotypes were filtered using the bcftools filter module. Both invariant and variant sites were required to have QUAL (quality of the variant call) ≥ 20 and MQ (root mean square mapping quality) ≥ 20, with DP (read depth) ≥ 8 for individual genotypes (DP ≥ 4 for females on the Z chromosome) and GQ (genotype quality) ≥ 20. All genotypes not fulfilling these criteria or within 5 bp of an indel (–SnpGap) were recoded as missing data.
Species relationships and demographic modelling of hybrid speciation
Relationships between H. elevatus, H. pardalinus, H. melpomene and other closely related species were investigated by building a phylogenetic network. The dataset was filtered to include only biallelic sites (excluding singletons) without missing data and at least 1 kb apart, using Plink v.1.9 (ref. 61). Pairwise absolute genetic distances between all pairs of samples were calculated using the disMat.py script (https://github.com/simonhmartin/genomics_general). The distance matrix was then used to construct a phylogenetic network using the NeighbourNet approach62, implemented in SplitsTree v.4.15.1 (ref. 63), with default parameters.
We also investigated species relationships by estimating a concatenated neighbour-joining tree. In this analysis, we included both variable and invariable sites, at least 1 kb apart and without missing data. The neighbour-joining tree was estimated from individuals’ pairwise distances using the R package ape v.5.7 (ref. 64) ‘read.dna’ and ‘nj’ functions. Trees were rooted using the ‘midpoint’ function from the R package phangorn v.2.11.1 (ref. 65). Bootstrap supports were obtained on the basis of 100 bootstrap replicates, using the ‘boot.phylo’ function in the R package ape v.5.7 (ref. 64).
Genealogical relationships along the genome between the three focal species (H. elevatus, H. pardalinus and H. melpomene) were further investigated using TWISST66 (https://github.com/simonhmartin/twisst), and using Heliconius nattereri as an outgroup species. Only SNPs fixed in the outgroup (H. nattereri), variable in the focal species and with a minimum allele frequency (MAF) of 0.05 were considered. Statistical phasing and imputation were performed using Beagle v.5.1 (ref. 67), with default settings. The phased filtered dataset was used to infer neighbour-joining phylogenies for non-overlapping windows of 1,000 SNPs (median size of around 23.6 kb), assuming the GTR substitution model, in PHYML (ref. 68). Exact weightings were computed for all phylogenies. Windows were classified into each of the following categories when weighting support was 0.5 or greater: (i) H. elevatus and H. pardalinus group together but are not reciprocally monophyletic; (ii) H. elevatus and H. pardalinus group together and are reciprocally monophyletic; (iii) H. elevatus and H. melpomene group together but are not reciprocally monophyletic; and (iv) H. elevatus and H. melpomene group together and are reciprocally monophyletic.
To infer the timing of introgression from H. melpomene into H. elevatus and its split from H. pardalinus, we used the multispecies coalescent-with-introgression (MSCi) model implemented in BPP v.4.6.2 (ref. 22) (A00 analysis). For each species of the three species, we selected four individuals to generate sequenced alignments. For H. melpomene, we used H. melpomene aglaope from Peru. Given the population structure between Amazonian and non-Amazonian population of both H. elevatus and H. pardalinus and evidence for gene flow between the two species in the Amazon, we first performed this analysis using the non-Amazonian populations (that is, H. elevatus bari and H. pardalinus sergestus). Loci were selected randomly from autosomes, requiring loci to be 2 kb long, a minimum distance of 20 kb to the next closest locus and 5 kb from the closest exon as annotated in H. melpomene assembly v.2.5. For each locus, individuals with more than 20% missing data and sites containing missing genotype calls were removed. Only loci containing all individuals and 800 bp passing filters were considered. Heterozygous sites were assigned IUPAC ambiguity codes. Demographic parameter estimation was performed using a fixed species tree, with introgression events (see Fig. 1e and Extended Data Fig. 1). An inverse gamma prior (invG) was applied both to the root age (τ0) and to all populations’ effective population sizes (θ) – invG(a = 3, b = 0.06) and invG(a = 3, b = 0.04), respectively. A beta prior was applied to the introgression probability (j) – Beta(a = 1, b = 1). The MCMC was run for 1,000,000 iterations after 50,000 iterations of burn-in, sampling every 10 iterations.
Historic and recent gene flow
Species-diagnostic SNPs
To characterize instances of recent gene flow between Amazonian H. pardalinus and H. elevatus, we relied on ancestry-informative SNPs (allele frequency difference ≥ 0.8) between these two groups. Only ancestry-informative SNPs at least 10 kb apart were considered. For each SNP, an ancestry score of 0 and 1 was assigned for H. elevatus homozygous and H. pardalinus homozygous variants, respectively, and 0.5 for heterozygous. We then calculate each individual’s ancestry (average ancestry across SNPs) and heterozygosity, on the basis of the ancestry-informative SNPs passing the filters. A custom R script was used to visualize genotypes of species-diagnostic SNPs across the genomes of different individuals. The same approach was used to determine species-diagnostic SNPs between Amazonian H. elevatus and H. melpomene.
f
4 statistics
We calculated the f4 statistics in ADMIXTOOLS (ref. 69) to measure shared drift between pairs of populations of different species in the same location versus between pairs of populations of the same species in different locations. Shared drift between populations of different species in the same location is indicative of gene flow between species, and shared drift between populations of the same species in different locations is indicative of grouping by species. Only autosomal biallelic SNPs were considered in this analysis. Standard errors were estimated through a weighted block jackknife approach over 500-kb blocks. We also measured the Euclidean geographic distance between all possible pairs of locations and performed a Mantel test for its correlation with the f4 statistics.
Estimates of gene flow between population pairs
We used G-PhoCS (ref. 70) to estimate divergence times, effective population sizes and migration rates between pairs of populations of H. elevatus and H. pardalinus, both within and between species. In all analyses, we also include one individual from an outgroup species (Heliconius besckei) and estimate model parameters assuming possible bidirectional migration between the two ingroup species. G-PhoCS uses multiple independent neutrally evolving loci to infer demographic parameters. Therefore, we first defined regions of the genome within scaffolds larger than 1 Mb and at least 1 kb away from exons as annotated in H. melpomene assembly v.2.5. Within these regions we then selected 1-kb blocks that were at least 10 kb apart from the nearest block and produced sequence alignments, masking annotated repetitive elements and CpG islands identified with the software gCluster (ref. 71). Because previous studies have reported extensive introgression between both H. elevatus and H. pardalinus with other Heliconius species in large regions of the genome surrounding the three major colour pattern loci, we excluded blocks in chromosomes containing these loci (chromosomes 10, 15 and 18). We also excluded blocks in the Z chromosome owing to its different effective population size. For each alignment, we excluded individuals with more than 60% missing genotype calls, and only alignments with at least three individuals per population (or all individuals in the populations for those with fewer than three individuals) and a minimum of 100 bp for which no more than 25% of individuals had missing genotype calls were considered. We coded heterozygous genotype calls using IUPAC codes. A gamma prior with α = 2 and β = 100 was used for both the mutational-scaled effective population size (θ) and the divergence time (τ) between the two ingroup populations, whereas a gamma prior with α = 2 and β = 50 was used for the divergence time to the outgroup. For the mutation-scaled migration rates, we defined a gamma prior with α = 0.005 and β = 0.00001. The model was run three times, with a burn-in of 50,000 iterations (allowing for automatic fine-tuning of the parameters) followed by 200,000 iterations, sampling every 200 iterations. Convergence of the Markov chain and between the three different replicates was inspected using custom scripts. To convert the θ and τ estimates to absolute effective population size and divergence time, we assumed an average mutation rate (µ) of 2.9 × 10−9 substitutions per site per generation and an average generation time (g) of 0.25 years (ref. 72). We also obtain estimates of the effective migration rate (Nem) using the formula: NemAB = MAB × θB/4.
Simulations to infer robustness of G-PhoCS inferences
Whenever Nm > 1, estimates of Nm for the same population comparisons varied both in value and directionality between different replicate runs of G-PhoCS. To investigate the cause for these differences, we performed coalescent simulations using MSMS (ref. 73). We considered the same demographic scenario as for the G-PhoCS runs; that is, two sister populations (A and B) that diverged at TD1 and split from the outgroup (C) at TD2, and allowing either unidirectional or bidirectional migration between A and B. The split time between the two sister populations (TD1) was set to four million generations, and eight million generations for the split of the outgroup (TD2). An effective population size (Ne) of one million or five million was assumed for the two ingroup populations (400,000 for the outgroup), and varying levels of gene flow (Nm) were considered (0.01, 0.1, 1.0, 2.0 and 10.0). For each scenario, we simulated 100 trees in MSMS (ref. 73), from which we generated sequence alignments using Seq-Gen v.1.3.4 (ref. 74). Custom scripts were used to combine pairs of haploid sequences into diploid sequences, using IUPAC codes for heterozygous sites, and to convert the alignments to the G-PhoCS sequence format. Finally, we ran G-PhoCS for the simulated datasets using the same settings as described above. Whenever Nm > 1 in the simulated datasets, G-PhoCS showed a similar behaviour to what was seen in our analysis of the Heliconius data (Supplementary Table 6). We believe that this effect is due to the difficulty of estimating gene flow when the populations are nearly panmictic. Hence, for each population pairwise comparison, the highest Nm estimate among the three replicate runs is presented in Fig. 2b.
Species-tree inference
Phylogenetic relationships between the H. pardalinus and H. elevatus major groups were inferred using the multispecies coalescent (MSC) approach implemented in BPP v.4.6.2 (ref. 22), while accounting for incomplete lineage sorting. Three H. p. sergestus individuals (with the highest coverage) and three H. elevatus individuals from the Guianas (the individual with the highest coverage per location (French Guiana, Suriname and Venezuela)) were considered. For Amazonian H. pardalinus and H. elevatus, again, only the individual with the highest coverage from each of three locations—Ecuador, Bolivia and Brazil—was included. For this analysis, loci were selected by first defining regions of the genome within scaffolds larger than 1 Mb. To minimize the effect of linked selection, these regions also had to be at least 2 kb from exons as annotated in Heliconius melpomene v.2.5 (Hmel2.5, ref. 54). Because the analysis assumes no intra-locus recombination and independence between loci, we selected loci of 100–250 bp and at least 2 kb from neighbouring loci. Sequence alignments were produced for all loci, masking repetitive elements as annotated in the reference genome and CpG islands identified with the software gCluster (ref. 75). For each locus, individuals with more than 50% missing genotype calls were excluded from the alignment and only loci with at least two individuals per population were considered. Furthermore, sites with more than 20% of individuals with missing genotype calls were removed and loci with less than 50 bp passing filters were excluded. Loci were grouped into blocks of 100 loci, and those overlapping the inversion on chromosome 15 were grouped in a separate block. Species-tree estimation was then performed in BPP v.4.6.2 using the A01 analysis (species-tree inference assuming no gene flow). Inverse gamma priors (invGs) were applied both to the root age (τ0) and to effective population sizes (θ) – invG(3, 0.06) and invG(3, 0.04), respectively. Parameters were scaled assuming a mutation rate of 2.9 × 10−9 substitutions per site per generation and a generation time of 0.25 years (ref. 54). The MCMC was run for 1,000,000 iterations after 50,000 iterations of burn-in, sampling every 10 iterations. Three independent runs were performed for each block, using different starting species trees, and only blocks showing consistency among the three independent runs were considered. The most abundant estimated tree across the genome showed both species to be paraphyletic with respect to each other (Extended Data Fig. 2). We believe that this non-taxonomic arrangement is due to gene flow, which is not accounted for in the model.
Demographic modelling by analysis of site-frequency spectra
To understand the prevalence of gene flow at different stages of the speciation history of H. elevatus and H. pardalinus, we performed demographic modelling based on analysis of the site-frequency spectrum (SFS) using fastsimcoal2 v.2.7.0.2 (ref. 76). For this analysis, we considered all Amazonian and non-Amazonian populations of H. elevatus and H. pardalinus. Individuals with more than 50% missing data were excluded from the analysis and only sites genotyped in at least 80% of the individuals (including all four H. p. sergestus) were considered. Furthermore, only sites at least 2 kb apart and at least 10 kb from exons were considered, to mitigate the effects of linkage disequilibrium and linked selection, respectively. We further excluded sites within repetitive regions as annotated in the H. melpomene assembly Hmel2.5. The 209,115 sites that were retained after filtering were polarized by assessing the allele present in three outgroup species—H. besckei, Heliconius ismenius telchinia and Heliconius numata robigus. From each of the outgroup species, we chose one individual with the highest coverage and assigned the ancestral allele to each site if it was genotyped and monomorphic in the outgroup species. The unfolded multidimensional site-frequency spectrum (multiSFS) was generated using easySFS (https://github.com/isaacovercast/easySFS), using the recommended down projection approach (four individuals of H. p. sergestus; 10 northeastern group H. elevatus; and 20 H. pardalinus and 20 H. elevatus individuals from the Amazon) to maximize the number of segregating sites while accounting for missing data. For each demographic model, fitting of the simulated multidimensional site-frequency spectra to the empirical data was maximized using the composite-likelihood method implemented in fastsimcoal v.2.7 (ref. 77). For all model parameters, we used wide search ranges from which initial starting parameter values were randomly sampled. For each model, we performed 100 independent fastsimcoal2 runs. Parameter estimates optimization was performed for 40 expectation-maximization cycles and the expected SFS was estimated using 100,000 coalescent simulations. The best fitting model was identified by means of the Akaike information criterion, considering for each model the optimization run with the highest likelihood (using the script https://github.com/speciationgenomics/scripts/blob/master/calculateAIC.sh). To account for stochasticity in the likelihood approximation, we further compared likelihood distributions of the different models by performing 100 independent runs from parameter values estimated under the most likely replicate run for each model. Finally, for the best fit model, confidence intervals around the maximum likelihood parameter estimates were obtained by nonparametric block-bootstrapping. For this, the 209,115 sites were divided into 100 blocks and sampled with replacement.
Genomic islands of divergence and introgression
Summary statistics
We calculated between-population differentiation (FST) for Amazonian and non-Amazonian populations of both H. elevatus and H. pardalinus groups, in sliding windows of 25 kb (5 kb step size) along the genome using the script popgenWindows.py (https://github.com/simonhmartin/genomics_general). The script implements a version of Hudson’s KST (ref. 78), modified to avoid weighting nucleotide diversity in each population by sample size. Individuals with more than 50% missing data were removed. Only sites with a maximum of two alleles, and with at least three individuals with genotype calls per population (or the total number of individuals in populations with fewer than three individuals) were considered. Only windows with at least 10% of sites passing filters were considered in the analysis.
Topology weighting
To determine genomic regions in which H. elevatus and H. pardalinus are reciprocally monophyletic (that is, genomic regions that are potentially involved in species barriers), genealogical relationships between Amazonian and non-Amazonian populations along the genome were quantified using TWISST66 (https://github.com/simonhmartin/twisst). The same dataset as for FST was used, but also adding five individuals of representative outgroup species (H. besckei, H. ismenius, H. numata, H. nattereri and H. ethilla). Statistical phasing and imputation were performed using Beagle 5.1 (ref. 67), with default settings. Only SNPs fixed in all outgroup individuals and variable in the ingroup population with an MAF of 0.05 were considered. The phased filtered dataset was used to infer neighbour-joining phylogenies for windows of 100 SNPs (slide every 25 SNPs), assuming the GTR substitution model, in PHYML (ref. 68). Exact weightings were computed for all phylogenies. To estimate the proportion of trees supporting a grouping of individuals by species versus grouping by geography, we considered five groups: (i) H. elevatus from the Guianas (Venezuela, Suriname and French Guiana); (ii) H. elevatus from the Amazon; (iii) H. pardalinus from the Amazon; (iv) H. p. sergestus (Andes); and (v) an outgroup, H. nattereri. Because we hypothesize that introgression from H. melpomene into H. elevatus could be involved in speciation of the latter and H. pardalinus, the same analysis was performed including only Amazonian H. elevatus, Amazonian H. pardalinus and two H. melpomene populations (H. m. amaryllis and H. m. aglaope). By including H. ethilla (a sister species to H. elevatus and H. pardalinus) as a fifth population, we were able to polarize the genealogies, allowing determination of the direction of introgression.
Association between H. melpomene introgression and genomic islands of divergence
To test whether H. melpomene introgression in the genome of H. elevatus is associated with genomic islands of divergence between sympatric H. elevatus and H. pardalinus, we performed a Fisher’s exact test. First, we defined genomic islands of divergence as regions with FST ≥ 0.2 and in which TWISST recovered both H. pardalinus and H. elevatus as reciprocally monophyletic (with weight ≥ 0.8). Second, we defined as introgressed, genomic regions in which TWISST grouped H. elevatus with H. melpomene with a weight ≥ 0.8. We then performed a Fisher’s exact test, as implemented in bedtools v.2.30.0 (ref. 79), to test whether the two sets of genomic intervals overlap more than expected given the size of the reference genome.
Genetic mapping of traits involved in reproductive isolation
Captive populations of Amazonian H. elevatus pseudocupidineus, H. pardalinus butleri and H. m. agalope were established in outdoor insectaries in Tarapoto, Peru and in heated indoor insectaries in York, UK, as previously described17. Crosses for QTL mapping were generated by mating H. elevatus with H. pardalinus to produce F1 broods, and then by either crossing these amongst themselves to generate F2 broods or backcrossing to parental taxa.
Colour pattern phenotyping
Dorsal surfaces of wings from 12 H. elevatus, 19 H. pardalinus, 14 H. m. aglaope, 348 F2 and 50 backcross hybrids were photographed in a light box against a white background using a Canon EOS D1000 together with an X-rite ColorChecker Mini (Supplementary Table 7). From each image, we selected a single forewing and hindwing for analysis, clipped the image to the wing outlines and flipped wings when necessary to ensure that all were similarly orientated (resulting in two files; one forewing and one hindwing). To align the wings so that pixels represent homologous units among individuals, we used image registration80, a regression-based method that aligns two sets of wings (a source and a reference) according to intensity-based similarity. We chose the reference set of wings using the PCA of wing shape (see below). For forewing (36 PCs) and hindwing (26 PCs) we found the mean value for each PC across all F2 and backcross individuals. We assigned the reference individual as the individual that had the minimum deviation from these mean values (summed across all PCs). We then checked all alignments by eye. To allow for minor misalignment or damage to wings, we included pixels in which up to 5% of individuals had missing RGB values.
Wing shape
Wing shape was quantified in 31 H. elevatus, 26 H. pardalinus, 10 H. m. aglaope and 308 F2 and 36 backcross hybrids using landmark-based geometric morphometrics analyses (Supplementary Table 7). The ventral side of the butterfly wings was scanned using a flatbed scanner at 300 dpi and landmarks were placed at specific vein intersections81 on the forewing (20 landmarks) and hindwing (15 landmarks) using tpsDig282. Landmark coordinates were adjusted for size and orientation using a Procrustes analysis from the package geomorph83. Forewings and hindwings were analysed separately.
Flight
H. elevatus (n = 12), H. pardalinus (n = 13), H. m. aglaope (n = 5) and F2s (n = 40) were filmed flying freely in a large flight cage (5 × 2.5 × 2 m) using a GoPro HERO 4 Black camera at 239.7 frames per second at a resolution of 720p (Supplementary Table 7). Videos were studied in slow motion using GoPro Studio v.2.5.9.2658. Flight sequences in which an individual was flying straight and level for at least five wing beats were selected to measure wing beat frequency (WBF). WBF was measured by counting the number of complete wing beats and the number of video frames. Five WBF measurements were taken per individual from separate flight sequences and used to calculate the individuals’ mean WBF by dividing the total number of wing beats across all flight sequences by the total flight time estimated from the number of video frames.
Female host plant preference
Host plant preference assays for QTL mapping were performed by introducing single H. elevatus, H. pardalinus and F2 females (n = 24, 32 and 31, respectively) into cages measuring 1 m (W) × 2 m (L) × 1.7 m (H), with two approximately equally sized shoots of the host plants (P. riparia and P. venusta) placed in the back corners. At the end of each day, the number of eggs laid on each plant species was recorded and the eggs were removed (Supplementary Table 7). To compare the oviposition preference of Peruvian H. elevatus, H. pardalinus and H. melpomene, groups of females (wild-caught and/or reared) of a given taxon were released into a large cage (2.5 m (W) × 5 m (L) × 2 m (H)) containing single representatives of 21 species of Passiflora that are commonly found near Tarapoto, Peru and which represent potential host plants17. The number of eggs laid on each host plant was recorded at the end of each day. A total of 126 females were tested, resulting in a total of 889 eggs (176 from 35 H. elevatus females, 288 from 24 H. melpomene and 425 from 51 H. pardalinus).
Male sexual preference
To assay male preference for female colour pattern, we presented H. elevatus, H. pardalinus and F2 males (n = 46, 66 and 106, respectively) with a pair of model female wings (one H. elevatus and one H. pardalinus), and recorded courtship events (full details of the experimental set-up are provided in ref. 17). Males were tested individually and placed in the experimental cage one day earlier to allow acclimatization. Trials lasted 15–30 min. The number of courtships (defined as sustained flight 5–15 cm over a model) by the males directed towards each of the model wings was recorded (Supplementary Table 7).
Phenotyping of androconial volatiles
Male Heliconius produce complex chemical blends of volatile compounds from their hindwing androconia. These blends have been shown to function as sex pheromones in several other Heliconius species and in butterflies in general84,85. Androconial regions were excised from 13 H. elevatus, 10 H. pardalinus, 7 H. melpomene malleti individuals and 122 F2 and 17 backcross hybrids 21 days after eclosion, and suspended in dichloromethane. The extracts were analysed by gas chromatography–mass spectrometry (GC–MS), as reported previously16,86 (Supplementary Table 7) on a 7890A GC-System coupled with an MSD 5975C mass analyser (Agilent Technologies) instrument fitted with an HP-5MS column (50 m, 0.25 mm internal diameter, 0.25 µm film thickness). The ionization method was electron impact with a collision energy of 70 eV. Conditions were as follows: inlet pressure 9.79 psi, He 20 ml min−1, injection volume 1 µl. The GC was programmed as follows: start at 50 °C, increase at 5 °C min−1 to 320 °C and hold that temperature for 5 min. The carrier gas was He at 1.2 ml min−1. For all identified compounds, the concentration was calculated from the peak’s area, as reported by AMDIS software87. Each compound’s chromatogram was interpreted by AMDIS through the NIST databases and the additional databases compiled at the Institute of Organic Chemistry of Technische Universität Braunschweig. All identifiable compounds running between undecane and nonacosanal were scored. Potential contaminants or extraneous compounds were excluded, together with compounds that appeared fewer than 10 times across the entire dataset.
DNA extraction and RAD library preparation for QTL analysis
RNA-free genomic DNA was extracted from thoracic tissue using a Qiagen DNeasy Blood and Tissue Kit following manufacturer’s standard protocol. Restriction-site-associated DNA (RAD) libraries were prepared using a protocol modified from (ref. 88, using a PstI restriction enzyme, sixteen 6-bp P1 barcodes and eight indexes. DNA was Covaris sheared to 300–700 bp and gel size selected. A total of 128 individuals were sequenced per lane, with 125-bp paired-end reads, on an Illumina HiSeq 2500 (Supplementary Table 8).
SNP calling
Fastq files from each RAD library were demultiplexed using process_radtags from Stacks89, and BWA-MEM90 was used with default parameters to map the reads to the H. melpomene assembly v.2.5 (ref. 91). BAM files were then sorted and indexed with Samtools (ref. 90), and Picard v.1.119 (https://github.com/broadinstitute/picard) was used to add read group data and mark PCR duplicates. To check for errors, confirm pedigrees and assign samples with unrecorded pedigree to families, we used Plink v.1.9 (ref. 61) to estimate the fraction of the genome that is identical by descent (IBD; \(\widehat{{\boldsymbol{\pi }}}\)) between all pairwise combinations of samples (siblings and parent-offspring comparisons should yield \(\widehat{{\boldsymbol{\pi }}}\) values close to 0.5). In addition, for specimens that were sequenced multiple times, we checked that samples derived from the same individual \((\widehat{{\boldsymbol{\pi }}}\approx 1)\). We then merged these samples, using the MergeSamFiles command from Picard Tools, and used Samtools mpileup command to call SNPs.
Linkage map construction
Linkage maps were built for hybrid and within-species crosses using Lep-MAP3 (ref. 92). Pedigrees are provided in Supplementary Table 8. SNPs were first converted to posterior genotype likelihoods for each of ten possible SNP genotypes. We used the ParentCall2 module to correct erroneous or missing parental genotypes and call sex-linked markers using a log-odds difference of >2 (ZLimit) and halfSibs = 1. We used Filtering2 to remove SNPs showing segregation distortion, specifying a P value limit of 0.01; that is, there is a 1:100 chance that a randomly segregating marker is discarded. We then separated markers into chromosomes using their Hmel2.5 scaffold. To obtain genetic distances between markers, we fixed the order of the markers to their order in Hmel2.5, and then evaluated this order, using all markers and specifying no recombination in females. We then used map2gentypes.awk to convert the Lep-MAP3 output to four-way fully informative genotypes with no missing data. To assign ancestry to phased haplotype blocks in the hybrid linkage map, we used biallelic sites with significantly different allele frequencies in the parental species (χ2 test applied to sequences for 26 H. elevatus and 47 H. pardalinus individuals from Peru and Ecuador).
QTL mapping
The colour pattern, androconial volatiles and wing shape datasets are multivariate and highly collinear. We therefore used PCA to reduce the phenotypic values for the hybrids to orthogonal vectors (PCs), which we then used as phenotypes in QTL mapping. For wing shape, we applied PCA to the Procrustes coordinates. For the androconial volatiles, we applied the PCA to the set of compounds that were significantly different between the two parental species (one-tailed paired t-test). For colour pattern, we performed a PCA on the concatenated RGB values from the aligned images and retained PCs that explained more than 1% of the variance.
For colour pattern, androconial volatiles, wing shape and WBF, we tested for associations between phenotype and genotype using linear models with normal errors. For wing shape, we included centroid size as a covariate to control for allometry. For female host plant choice and male preference for female colour pattern, we (i) logistically transformed the proportions and used linear models with normal errors; and (ii) used generalized linear mixed models with an individual-level random effect to account for overdispersion and binomial errors. The significance of QTL scans was assessed by permuting the phenotypes relative to the genotypes (1,000 permutations). For traits phenotyped in both males and females, a sex-specific significance threshold was used to avoid spurious sex linkage (see Supplementary Table 5).
We first analysed all data using F2s only, using R/qtl (ref. 93) to estimate genotype probabilities at 1-cM intervals, using the Haldane mapping function and an assumed genotyping error rate of 0.001. These genotype probabilities were then used as the dependent variable in models, and for traits phenotyped in both males and females we included sex and cross direction as covariates for markers on the sex chromosome. For traits for which backcrosses had been scored in addition to F2s, we performed an additional round of analyses combining F2s with backcrosses. In this case, we used the categorical genotypes (EE, EP and PP) inferred from linkage mapping as the dependent variables, and added random effects for cross type (three levels: F2, backcross to H. elevatus, backcross to H. pardalinus), sex or individual. Model structures and estimated coefficients are provided in Supplementary Table 5.
To test whether QTLs are significantly clustered (that is, genetically linked), for each QTL we estimated the recombination probability with its nearest neighbouring QTLs (using the position of the maximum LOD score), and took the mean of the resulting vector (low values indicate that most QTLs are linked to at least one other QTL; high values indicate that most QTLs are unlinked). We then randomized the position of the QTLs 10,000 times and compared the observed data to the randomized dataset using a two-tailed test (P = the proportion of randomized datasets that give a result more extreme than the observed data × 2). When multiple QTLs overlapped within the phenotypic classes forewing colour pattern, hindwing colour pattern, forewing shape and hindwing shape, we included only the best supported QTL (highest LOD score). To test whether species and introgression topologies are associated with QTLs, we applied the same test.
To identify putative structural rearrangements between H. elevatus and H. pardalinus, we compared recombination rates between F2s and within-species crosses (F2s, 441 individuals across 26 families; H. elevatus, 179 individuals across 9 families; H. pardalinus, 296 individuals across 15 families). Regions that are freely recombining within species but not in F2s represent candidate rearrangements that might facilitate divergence and speciation. The probability of the within-species recombination events observed within an F2 breakpoint can be given as pn, where p is the fraction of parental individuals in the mapping crosses and n is the observed number of recombination events. We estimated pn within each F2 breakpoint and considered breakpoints in which p < 0.01 to be candidate rearrangements.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
[ad_2]
Source link