Tag: Molecular evolution

Stereochemistry in the disorder–order continuum of protein interactions

[ad_1]

Synthetic peptides

Synthetic l- and d-peptides of H1_155–175, PUMA_130–156, ANAC013_254–274, ANAC046_319–338 and DREB2A_255–272 were purchased from Pepscan (now Biosynth) at a minimum purity of 95% and purified by HPLC. The d-peptides contain amino acid residues with a stereoisomeric d-form of each chiral carbon. The peptides were either resuspended in MilliQ H₂O or in MilliQ H₂O containing 50 mM NH₄HCO₃ and lyophilized repeatedly to remove leftover trifluoroacetic acid from the last purification step by the manufacturer. Peptides were then either resuspended directly in the buffer used for experiments or in H₂O without 50 mM NH₄HCO₃ to measure the concentration. If no aromatic residue was present in the peptide sequence, the absorbance at 214 nm was used. The extinction coefficient was calculated using Bestsel⁵¹.

Expression and purification of proteins

¹⁵N-labelled and unlabelled full-length ProTα was expressed and purified as described²². The double-cysteine variant of ProTα (E56C/D110C) used in smFRET experiments was expressed and purified as described²⁴, with some modifications. In brief, ProTα was dialysed against Tris buffer (50 mM Tris, 200 mM NaCl, 2 mM DTT, 1 mM EDTA; pH 8), during which the hexa-histidine tag was cleaved using HRV 3 C protease. Cleaved ProTα was purified further using Ni Sepharose Excel resin (Cytiva, formerly GE Healthcare) and a HiPrep Q FF column (Cytiva) with a gradient from 200 mM to 1 M NaCl. Buffer was exchanged (HiTrap Desalting column (Cytiva)) to labelling buffer potassium phosphate (100 mM, pH 7). ¹⁵N-labelled and unlabelled GST–MCL1_152–308 was expressed in BL21(DE3)pLysS Escherichia coli in the presence of ampicillin. Cells were grown at 37 °C in LB or M9 minimal medium (for ¹⁵N labelling) until OD₆₀₀ reached 0.6, then induced with IPTG (1 mM final concentration) and collected after 4 h. The cell pellet was resuspended in Tris buffer (20 mM Tris, 100 mM NaCl; pH 8), then lysed by sonication. After pelleting again, the supernatant was applied to GST Sepharose beads (Cytiva), and GST–MCL1_152-308 was eluted using Tris-GSH buffer (20 mM Tris, 100 mM NaCl, 10 mM GSH; pH 8). The GST tag was removed using TEV protease (0.7 mg) overnight at room temperature. Final purity was reached using a Superdex 75 26/60 column (Cytiva), equilibrated with 50 mM phosphate buffer (pH 7). ¹³C,¹⁵N-labelled MCL1_152–308 was expressed as described⁵² and purified as above. The expression and purification of ¹⁵N-labelled and unlabelled RCD1-RST_499–572 were carried out as previously described²⁸ with the lysis buffer changed to 20 mM Tris-HCl, pH 9.0, 20 mM NaCl. The buffer used in the last purification step by size exclusion chromatography on a Superdex 75 10/300 GL column (Cytiva) was the buffer described for the individual methods.

¹³C,¹⁵N-labelled ANAC046_319–338 or ANAC013_254–274 were expressed with a His₆-SUMO fusion tag in BL21(DE3) E. coli in the presence of kanamycin (50 μg ml⁻¹). Cells were grown in LB at 37 °C until OD₆₀₀ reached 0.6 and the medium was changed to M9 minimal medium, followed by induction with IPTG to 1 mM final concentration and collected after incubation overnight at 16 °C. The cells were resuspended in lysate buffer (50 mM and 20 mM Tris-HCl for ANAC046_319–338 and ANAC013_254–274, respectively, pH 8.0, 300 mM NaCl) and sonicated. After the centrifugation, the lysate was purified using TALON resin equilibrated in the buffers just described. The fusion peptides were eluted with an equivalent buffer containing 250 mM imidazole. After a dialysis step into 20 mM Tris-HCl pH 8.0, 100 mM NaCl, the fusion tag was cleaved with ubiquitin-like-specific protease 1 (ULP1) (molar ratio between peptide and protease were 1:320 and 1:500 for ANAC046_319–338 and ANAC013_254–274, respectively) overnight at 4 °C. A second purification step with TALON resin was performed resulting in the peptides in the flowthrough. The purification of the peptides was finalized by size exclusion chromatography on a Superdex peptide 10/300 GL column (Cytiva) and freeze-dried to be resuspended in the desired buffer.

AlphaFold structure modelling

Protein interaction models of RCD1-RST_499–572 in complex with ANAC046_319–338 or ANAC013_254–274 were generated using AlphaFold3³⁰ and analysed in PyMOL (The PyMOL Molecular Graphics System, version 3.0 Schrödinger, LLC.). The five generated models for each complex were assessed manually and compared with the secondary chemical shifts of C^α of the l-ligand recorded using ZZ-exchange or CEST (see NMR spectroscopy method). The structures agreeing with the experimental data were visualized in PyMOL or Chimera X⁵³.

Far-UV CD spectropolarimetry

Far-UV CD spectra of l- and d-peptides of H1_155–175, PUMA_130–156, ANAC013_254–274, ANAC046_319–338, and DREB2A_255–272 were measured on a Jasco 815 spectropolarimeter with a Jasco Peltier control in the range of 260–190 nm at 20 °C. Concentrations of peptides varied between 10–30 µM in either MilliQ H₂O, pH 7.0 (PUMA_130–156, H1_155–175) or 20 mM NaH₂PO₄/Na₂HPO₄, pH 7.0 (ANAC013_254–274, ANAC046_319–338, DREB2A_255–272) with 1 mM TCEP in the samples containing ANAC046 peptides. A quartz cuvette with a 1 mm path length was used and 10 scans were recorded and averaged with a scanning speed of 20 nm min⁻¹ and response time of 2 s. A spectrum of the buffer using identical setting was recorded for each protein and subtracted the sample spectrum.

NMR spectroscopy

All NMR spectra were recorded on Bruker Avance III 600 MHz, 750 MHz or an Avance NEO 800 MHz (for ¹H) spectrometers equipped with cryoprobes. Natural abundance ¹H,¹⁵N and ¹H,¹³C-HSQC spectra were recorded on all peptides at either 10 °C or 25 °C. Peptides (0.5 mM) in sample buffer containing 20 mM Na₂HPO₄/NaH₂PO₄ pH 7.0, 100 mM NaCl, 10 % (v/v) D₂O, 0.02 % (w/v) NaN₃ and 0.7 mM 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS) for ANAC046_319–338, ANAC013_254–274 and DREB2A_255–272 with the addition of 1 mM DTT in the samples containing ANAC046 peptides. ¹H,¹⁵N-HSQC spectra were recorded on 50 µM ProTα, with or without 500 µM l- or d-H1_155–175 in TBSK (ionic strength 165 mM; pH 7.4). ¹H,¹⁵N-HSQC spectra were recorded on 50 µM MCL1, with or without 45 µM l- or 2.5 mM d-PUMA_130–156, in Tris (50 mM; pH 7.0) to compare at 90% saturation, as calculated from K_d values. Assignments of ¹³C,¹⁵N-MCL1 in complex with l-PUMA_130–156 were completed from a series of HNCACB and HNCOCACB 3D spectra as described⁵⁴, and deposited to Biological Magnetic Resonance Data Bank (BMRB) under accession 52264. ¹H,¹⁵N-HSQC spectra were recorded on ¹⁵N-labelled 100 µM RCD1-RST_499–572 in 20 mM Na₂HPO₄/NaH₂PO₄ pH 7.0, 100 mM NaCl, 10 % (v/v) D₂O, 0.02 % (w/v) NaN₃ and 0.7 mM DSS at 25 °C in the absence and presence of each stereoisomeric forms of 0−200 µM ANAC046_319–338, ANAC013_254–274 and DREB2A_255–272 in the following ratios; 1:0, 1:0.2, 1:0.4, 1:0.6, 1:0.8, 1:1 and 1:2. Assignments of free ProTα and free RCD1-RST were taken from BMRB entries 27215 and 50545, respectively^22,28.

Amide CSPs were calculated from the ¹H,¹⁵N-HSQCs in the absence and presence of the highest concentration of peptide used for each interaction using equation (1):

$${\Delta \delta }_{{\rm{NH}}}\,({\rm{ppm}})=\sqrt{{(\Delta {\delta }^{1}{\rm{H}})}^{2}+{(0.154\times \Delta {\delta }^{15}{\rm{N}})}^{2}}$$

(1)

The total protein CSP (CSP_{total_L}) induced by the binding of the l-enantiomer peptide was quantified by recording the CSPs of all visible ¹⁵N,¹H^N backbone resonances at >90% saturation (MCL1: 90%, RST (all cases): >99%, ProTα: >98%). The CSP for all visible residues were summed to obtain the total CSP. To adjust for unassigned residues, which include prolines, residues that could not be assigned, or residues not visible in either the bound or unbound states, the total CSP was divided by the fraction of residues for which CSPs were recorded. For instance, if CSPs were obtained for only half of the residues, the calculated total CSP was doubled to estimate the perturbation as if all residues were visible. This adjustment ensured that the total CSP could be compared between interactions, accounting for the lack of data from unassigned or invisible residues. The adjustment does not account for the fact that disappearing residues are likely involved in the interaction and thus also likely to experience larger than average CSPs.

2D NMR lineshape analysis

2D NMR lineshape analyses were performed for interactions of l-and d-peptides with RCD1-RST_499–572. The recorded ¹H,¹⁵N-HSQC spectra were processed using qMDD with exponential weighting functions with 4 Hz and 8 Hz line broadening in the direct and indirect dimensions, respectively. The 2D lineshape analysis was performed using the tool TITAN³¹ in Matlab (Mathworks) and was based on well-separated spin systems that were easily followed. If the trajectory of spin systems overlapped, the spin systems were grouped during fitting. All titrations were fitted to a two-state binding model, and at least 12 spin systems were picked for each analysis. Due to initial poor fitting for the titrations of the interaction ¹⁵N-RCD1-RST_499–572 and l- ANAC013_254–274, the K_d value was fixed using the values determined from ITC. Errors were determined by a bootstrap analysis using 100 replicas to determine the standard error from the mean. From the lineshape analysis, the fitted K_d and k_off values were used to calculate the association rate constant (k_on) based on equation (2):

$${K}_{{\rm{d}}}=\frac{{k}_{{\rm{off}}}}{{k}_{{\rm{on}}}}$$

(2)

The differences in activation free energies for binding between d- and l-peptides were estimated from the ratios of the association rate constants for both stereoisomers, ${k}_{{\rm{on}}}^{{\rm{D}}}$ and ${k}_{{\rm{on}}}^{{\rm{L}}}$, based on equation (3):

$${\Delta \Delta G}_{{\rm{unbound}}-\ddagger ,{\rm{D}}-{\rm{L}}}=RT{\rm{ln}}\left(\frac{{k}_{{\rm{on}}}^{{\rm{L}}}}{{k}_{{\rm{on}}}^{{\rm{D}}}}\right),$$

(3)

which was rewritten from Fersht (equation 18.22 in ref. ⁵⁵).

CEST NMR

CEST experiments were recorded for the l-peptide of ANAC046_319–338 to determine the chemical shift of its bound state with RCD1-RST_499–572. All experiments were recorded on a Bruker Avance Neo 800 spectrometer with a cryoprobe. A sample of 1 mM ¹³C,¹⁵N-labelled l-ANAC046_319–338 was prepared with 5% molar ratio of RCD1-RST_499–572 in 20 mM Na₂HPO₄/NaH₂PO₄ pH 6.5, 100 mM NaCl, 10 % (v/v) D₂O, 0.02 % (w/v) NaN₃, 0.7 mM DSS and 5 mM DTT. ¹⁵N-CEST data was acquired using pulse sequences as previously described⁵⁶ at 25 °C using three different B₁ field strengths: 6.25, 12.5 and 25 Hz. ¹³C-CEST data were acquired using special pulse sequences^57,58 (provided by L. Kay) as done in ref. ⁵⁹ at 25 °C with a B₁ field strength of 25 Hz. The free induction decays were transformed using NMRPipe⁶⁰ and peak intensities were extracted from each specific peak position. The intensities were analysed using ChemEx⁶¹ by fitting to a global two-state model implemented in the program. The fits reported on the change in chemical shifts for peaks experiencing CEST-transfer which directly reflects the chemical shift of the bound state of the peptide. The chemical shifts were extracted for the C^α and compared to a reference set⁶².

ZZ-exchange

For the complex between RST and ¹⁵N-ANAC013_254–274, identification of residues and their assignments were resolved by 3D heteronuclear NMR experiments with additional ZZ-exchange⁶³ NMR spectra recorded on a 50% saturated sample of 100 µM ¹³C, ¹⁵N-ANAC013_254–274 with 50 µM RCD1-RST_499–572 in20 mM Na₂HPO₄/NaH₂PO₄ pH 6.5, 200 mM NaCl, 10 % (v/v) D₂O, 0.02 % (w/v) NaN₃, and 0.7 mM DSS. The ZZ-exchange connections made it possible to manually track the assignment from the ¹H,¹⁵N-HSQC spectrum of the unbound ¹⁵N-ANAC013_254–274 to the RST-bound ¹⁵N-ANAC013_254–274. For the assignments of carbon resonances of ANAC013, two samples were prepared: ¹³C, ¹⁵N-ANAC013_254–274 (650 µM) w/wo RCD1-RST_499–572 (800 µM) in 20 mM Na₂HPO₄/NaH₂PO₄ pH 6.5, 200 mM NaCl, 10 % (v/v) D₂O, 0.02 % (w/v) NaN₃, and 0.7 mM DSS. Backbone resonances for the unbound peptide were manually assigned from analysis of ¹⁵N-HSQC, HNCA, HNCO and HNCACB experiments. All NMR spectra were acquired at 25 °C on a Bruker Avance III 750 MHz, except for ZZ-exchange which was on Bruker Avance III 600 MHz. All 3D experiments were recorded using non-uniform sampling.

Secondary chemical shifts

SCSs were calculated using the POTENCI⁶² web tool.

Transverse relaxation

To determine the dynamics of l-ANAC046_319–338 and l-ANAC013_254–274 w/wo RCD1-RST_499–572, the sample from ANAC013_254–274 assignment was reused whereas a new for ANAC046_319–338 was made: 75 µM ¹³C, ¹⁵N-ANAC046_319–338 with 180 µM RCD1-RST_499–572 in 20 mM Na₂HPO₄/NaH₂PO₄ pH 6.5, 100 mM NaCl, 10 % (v/v) D₂O, 0.02 % (w/v) NaN₃, 0.7 mM DSS and 5 mM DTT. The transverse relaxation rates, R₂ values, were acquired on a Bruker Avance Neo 800 spectrometer with the following relaxation delays: 33.8 ms, 67.6 ms, 101.4 ms, 169.0 ms, 236.6 ms, 270.4 ms, 338.0 ms and 405.6 ms (all triplicates), and a recycle delay of 2 s. Data were fitted to a one phase decay function.

Isothermal titration calorimetry

Prior to ITC, all samples were spun down at 17,000g for 10 min at the experimental temperature. ITC experiments involving ProTα and MCL1_152–308 as interaction partners were recorded on MicroCal PEAQ-ITC microcalorimeter (Malvern Panalytical). ProTα (7.1 µM) was placed in the cell and either l- or d-H1_155–175 (99.1 µM) in the syringe, in TBSK (165 µM ionic strength) at 20 °C. Each injection was 2 µl, with a total of 19 injections at an interval of 150 s between each. Data were fit using a fixed number of binding sites (fixed to one) so that fits could be standardized. For the MCL1_152-308 interactions, MCL1_152-308 (10 µM) was placed in the cell, with either l- or d-PUMA_130–156 (100 µM) in the syringe, in Tris (50 mM; pH 7.0) at 25 °C. Each of the 35 injections was 1 µl, with an interval of 150 s between each. The experiment was repeated for MCL1:d-PUMA_130–156, increasing the concentrations to 70 and 700 µM, respectively, while keeping the remaining experimental conditions identical. ITC experiments involving RCD1-RST_499–572 as interaction partner were recorded on a MicroCal ITC₂₀₀ microcalorimeter (MicroCal Instruments) at 25 °C in 50 mM Na₂HPO₄/NaH₂PO₄ pH 7.0, 100 mM NaCl. TCEP (1 mM) was added the sample buffer for interactions involving ANAC046 peptides. Concentrations of RCD1-RST_499–572 varied between 10–100 µM in the cell and 100-1000 µM of the ANAC046, ANAC013 or DREB2A peptides in the syringe. The first injection was 0.5 µl followed by 18 repetitions of 2 µl injections separated by 180 seconds. These experiments were processed using the Origin7 software package supplied by the manufacturer. The last 18 injections of each experiment were fitted to a one set of sites binding model. Triplicates were recorded for each interaction.

A salt titration was performed measuring the interaction between RCD1-RST_499-572 and the l-peptides of ANAC046_319–338 and ANAC013_254–274 by ITC, varying the NaCl concentration in the experimental buffer. Experiments were recorded on a MicroCal PEAQ-ITC microcalorimeter or a MicroCal ITC200 microcalorimeter at 25 °C. A 50 mM Na₂HPO₄/NaH₂PO₄ pH 7.0, 1 mM TCEP buffer was used with NaCl concentrations at 0, 50, 150 and 200 mM, with data at 100 mM NaCl recorded prior to and included in the analysis. Protein and peptide concentrations varied from 10–30 µM in the cell (RCD1-RST) and 100–300 µM in the syringe (peptides). A replica of each experiment was produced, and the isotherm were fitted as described above.

Fluorophore labelling for smFRET

ProTα was labelled by incubating it with Alexa Fluor 488 (0.7:1 dye to protein molar ratio) for 1 h at room temperature and sequentially with Alexa Fluor 594 (1.5:1 dye to protein molar ratio) overnight at 4 °C. Labelled protein was purified using a HiTrap Desalting column and reversed-phase high-performance liquid chromatography (RP-HPLC) on a SunFire C18 column (Waters Corporation) with an elution gradient from 20% acetonitrile and 0.1% trifluoroacetic acid in aqueous solution to 37% acetonitrile. ProTα-containing fractions were lyophilized and dissolved in buffer (10 mM Tris, 200 mM KCl, 1 mM EDTA; pH 7.4).

Single-molecule FRET measurements and analysis

Single-molecule fluorescence experiments were conducted using either a custom-built confocal microscope or a MicroTime 200 confocal microscope (PicoQuant) equipped with a 485-nm diode laser and an Olympus UplanApo 60×/1.20 W objective. Microscope and filter setup were used as previously described²⁴. The 485-nm diode laser was set to an average power of 100 μW (measured at the back aperture of the objective), either in continuous-wave or pulsed mode with alternating excitation of the dyes, achieved using pulsed interleaved excitation (PIE)⁶⁴. The wavelength range used for acceptor excitation in PIE mode was selected with a z582/15 band pass filter (Chroma) from the emission of a supercontinuum laser (EXW-12 SuperK Extreme, NKT Photonics) driven at 20 MHz, which triggers interleaved pulses from the 485-nm diode laser used for donor excitation. In our experiments, photon bursts (at least 3000 bursts) were selected against the background mean fluorescence counts and, in case of PIE, by having a stoichiometry ratio S of $0.2 < S < 0.75$, each originating from an individual molecule diffusing through the confocal volume. Transfer efficiencies were quantified according to $E={n}_{{\rm{A}}}/({n}_{{\rm{A}}}+{n}_{{\rm{D}}})$, where ${n}_{{\rm{D}}}$ and ${n}_{{\rm{A}}}$ are the numbers of donor and acceptor photons in each burst, respectively, corrected for background, channel crosstalk, acceptor direct excitation, differences in quantum yields of the dyes, and detection efficiencies. All smFRET experiments were performed in µ-Slide sample chambers (Ibidi) at 22 °C in TEK buffer with an ionic strength of 165 mM fixed with KCl; 140 mM 2-mercaptoethanol and 0.01% (v/v) Tween-20 were added for photoprotection and for minimizing surface adhesion, respectively. Single-molecule data were analysed using the Mathematica (Wolfram Research) package Fretica (https://schuler.bioc.uzh.ch/programs). For quantifying binding affinities, transfer efficiency histograms were constructed from single-molecule photon bursts identified as described above. Each histogram was normalized to an area of 1 and fit with a Gaussian peak function to extract its mean transfer efficiency $\langle E\rangle $. The mean transfer efficiency as a function of increasing concentration of d/l-H1_155–175, $\langle E\rangle ({C}_{{\rm{D/L-H1}}})$, was fit with:

$$\begin{array}{l}\langle E\rangle ({C}_{{\rm{D}}/{\rm{L}}-{\rm{H}}1}^{{\rm{t}}{\rm{o}}{\rm{t}}})\,=\Delta {\langle E\rangle }^{{\rm{s}}{\rm{a}}{\rm{t}}}\\ \times \frac{{C}_{{\rm{D}}/{\rm{L}}-{\rm{H}}1}^{{\rm{t}}{\rm{o}}{\rm{t}}}+{K}_{{\rm{d}}}+{C}_{{\rm{P}}{\rm{r}}{\rm{o}}{\rm{T}}\alpha }^{{\rm{t}}{\rm{o}}{\rm{t}}}-\sqrt{{({C}_{{\rm{D}}/{\rm{L}}-{\rm{H}}1}^{{\rm{t}}{\rm{o}}{\rm{t}}}+{K}_{{\rm{d}}}+{C}_{{\rm{P}}{\rm{r}}{\rm{o}}{\rm{T}}\alpha }^{{\rm{t}}{\rm{o}}{\rm{t}}})}^{2}-4{C}_{{\rm{D}}/{\rm{L}}-{\rm{H}}1}^{{\rm{t}}{\rm{o}}{\rm{t}}}{C}_{{\rm{P}}{\rm{r}}{\rm{o}}{\rm{T}}\alpha }^{{\rm{t}}{\rm{o}}{\rm{t}}}}}{2{C}_{{\rm{P}}{\rm{r}}{\rm{o}}{\rm{T}}\alpha }^{{\rm{t}}{\rm{o}}{\rm{t}}}}+{\langle E\rangle }_{0}\end{array}$$

(4)

Here, ${C}_{{\rm{D/L-H1}}}^{{\rm{tot}}}$ and ${C}_{{\rm{ProT\alpha }}}^{{\rm{tot}}}$ are the total concentration of d/l-H1_155–175 and ProTα, respectively, ${\langle E\rangle }_{0}$ is the mean transfer efficiency of free ProTα, and ${\Delta \langle E\rangle }^{{\rm{sat}}}$ is the increase in transfer efficiency from free ProTα to ProTα saturated with d/l-H1_155–175, while ${K}_{{\rm{d}}}$ is the equilibrium dissociation constant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

November 27, 2024
Exaptation of ancestral cell-identity networks enables C4 photosynthesis

[ad_1]

Plant growth

For the de-etiolation time course, seeds of Oryza sativa spp. japonica cultivar Kitaake and Sorghum bicolor BTx623 were incubated in sterile water for two days and one day, respectively, at 29 °C in the dark. Germinated seedlings were transferred in a dark room equipped with green light to a 1:1 mixture of topsoil and sand supplemented with fertilizer granules and grown for five days in the dark by wrapping the tray and lid several times with aluminium foil. Plants were placed in a controlled environment room with 60% humidity and temperatures of 28 °C and 20 °C during the day and night, respectively. Plants were exposed to light at the beginning of a photoperiod of 12 h light and 12 h dark and shoots were harvested at different time points during de-etiolation by flash-freezing tissue in liquid nitrogen. For the 0-h time point, seedlings were harvested in a dark room equipped with green light and flash-frozen immediately.

For microscopy analysis and enrichment of bundle-sheath nuclei using fluorescence-activated nuclei sorting, O. sativa spp. japonica cultivar Kitaake single-copy homozygous T2 seeds were de-husked and sterilized in 10% (v/v) bleach for 30 min. After washing several times with sterile water, seeds were incubated for two days in sterile water at 29 °C in the dark. Germinated seedlings were transferred to half-strength Murashige and Skoog medium with 0.8% agar in Magentas and grown for five days in the light in a growth chamber at temperatures of 28 °C and 20 °C during the day and night, respectively, and a photoperiod of 12 h light and 12 h dark.

Construct design and cloning

To generate constructs for the rice bundle-sheath marker line, the coding sequence for mTurquoise2 was obtained from a previous report⁴³, and the promoter sequence from Zoysia japonica PHOSPHOENOLPYRUVATE CARBOXYKINASE in combination with the dTALE STAP4 system was obtained from a previous report⁴⁴. The coding sequence of Arabidopsis thaliana H2B (At5g22880) was used as an N-terminal signal for targeting mTurquoise2 to the nucleus. All sequences were domesticated for Golden Gate cloning^45,46. Level 1 and Level 2 constructs were assembled using the Golden Gate cloning strategy to create a binary vector for the expression of STAP4-mTurquoise2-H2B driven by PCK-dTALE.

For the transactivation assay in rice protoplasts, transcription factor coding sequences were amplified using rice leaf cDNA or synthesized using GeneArt after domesticating the sequences for Golden Gate cloning^41,42 (OsDOF2, LOC_Os01g15900, OsDOF8, LOC_Os02g45200, OsDOF23, LOC_Os07g32510, OsDOF27, LOC_Os10g26620, SbDOF2, Sobic.003G121400, SbDOF8, Sobic.004G284400, SbDOF11, Sobic.001G489900 and SbDOF17, Sobic.006G182300). The coding sequences were assembled into a Level 1 module with a Zea mays UBI promoter and Tnos terminator module as described previously³⁷. For the minimal SIR promoter, nucleotides –980 to –829, as well as the endogenous core promoter (nucleotides –250 to +42), were fused with the LUCIFERASE reporter to measure transcription activity³⁷.

To generate GUS reporter rice lines, the minimal SIR promoter was assembled into a Level 1 module with the coding sequence for kzGUS (an intronless version of the GUS reporter gene) and the Tnos terminator as described previously³⁷. The DOF motifs in the minimal SIR promoter were mutated using PCR amplification.

Rice transformation

Oryza sativa spp. japonica cultivar Kitaake was transformed using Agrobacterium tumefaciens as described previously⁴⁷, with several modifications. Seeds were de-husked and sterilized with 10% (v/v) bleach for 15 min before placing them on nutrient broth (NB) callus induction medium containing 2 mg l⁻¹ 2,4-dichlorophenoxyacetic acid for four weeks at 28 °C in the dark. Growing calli were co-incubated with A. tumefaciens strain LBA4404 carrying the expression plasmid of interest in NB inoculation medium containing 40 μg ml⁻¹ acetosyringone for three days at 22 °C in the dark. Calli were transferred to NB recovery medium containing 300 mg ⁻¹ timentin for one week at 28 °C in the dark. They were then transferred to NB selection medium containing 35 mg l⁻¹ hygromycin B for four weeks at 28 °C in the dark. Proliferating calli were subsequently transferred to NB regeneration medium containing 100 mg l⁻¹ myo-inositol, 2 mg l⁻¹ kinetin, 0.2 mg l⁻¹ 1-naphthaleneacetic acid and 0.8 mg l⁻¹ 6-benzylaminopurine for four weeks at 28 °C in the light. Plantlets were transferred to NB rooting medium containing 0.1 mg l⁻¹ 1-naphthaleneacetic acid and incubated in Magenta pots for two weeks at 28 °C in the light. Finally, plants were transferred to a 1:1 mixture of topsoil and sand and grown in a controlled environment room with 60% humidity, temperatures of 28 °C and 20 °C during the day and night, respectively, and a photoperiod of 12 h light and 12 h dark.

Transactivation assay

Rice leaf protoplast isolation was performed as described previously^37,48. Protoplasts were transformed using Golden Gate Level 1 modules designed for constitutive expression of transcription factors, alongside the LUC reporter and the ZmUBIpro::GUS-Tnos transformation control, which were prepared with the ZymoPURE II Plasmid Midiprep Kit. The transformation mixture contained 2 µg of control plasmids, 5 µg of reporter plasmids and 5 µg of transcription factor plasmids, which were transformed into 180 µl of protoplasts. After incubating protoplasts for 20 h in the light, proteins were extracted using passive lysis buffer (Promega), and GUS activity was measured with 20 µl of the protein extract. A fluorometric MUG (4-methylumbelliferyl-β-d-glucuronide) assay was used for quantifying GUS activity⁴⁹ in a reaction mixture of 200 µl containing 50 mM phosphate buffer (pH 7.0), 10 mM EDTA-Na₂, 0.1% (v/v) Triton X-100, 0.1% (w/v) N-lauroylsarcosine sodium, 10 mM DTT and 2 mM MUG. The assay was performed at 37 °C, and 4-methylumbelliferone (4-MU) fluorescence was recorded every 2 min for 20 cycles at 360 nm excitation and 450 nm emission using a CLARIOstar plate reader. In addition, LUC activity was determined using 20 µl of protein sample and 100 µl of LUC assay reagent from Promega. Transcription activity was quantified as LUC luminescence relative to the rate of MU accumulation per second.

GUS staining

GUS staining was performed as described previously⁴⁹, with minor modifications. Leaf tissue was fixed in 90% (v/v) acetone for 12 h at 4 °C. After washing with 100 mM phosphate buffer (pH 7.0), samples were transferred into 1 mg ml⁻¹ 5-bromo-4-chloro-3-indolyl glucuronide (X-Gluc) GUS staining solution and vacuum was applied five times for 2 min each. The samples were incubated at 37 °C for 48 h. To clear chlorophyll, samples were incubated in 90% (v/v) ethanol at room temperature. Cross-sections were prepared with a razor blade and images were taken with an Olympus BX41 light microscope.

To quantify GUS activity, a fluorometric MUG assay was used⁴⁹ as described above, using 200 mg of mature leaf tissue. A standard curve of ten 4-MU standards was used to determine the 4-MU concentration in each sample.

Confocal microscopy

To test the bundle-sheath-specific expression of mTurquoise2-H2B, recently expanded leaf 3 of seven-day-old seedlings was prepared for confocal microscopy by scraping the adaxial side of the leaf blade two to three times with a sharp razor blade, transferring to water to avoid drying out and then mounting on a microscope slide with the scraped surface facing upwards. Confocal imaging was performed on a Leica TCS SP8 X using a 10× air objective (HC PL APO CS2 10×0.4 Dry) with optical zoom, and hybrid detectors for fluorescent protein and chlorophyll autofluorescence detection. The following excitation (Ex) and emission (Em) wavelengths were used for imaging: mTurquoise2 (Ex = 442, Em = 471–481), chlorophyll autofluorescence (Ex = 488, Em = 672–692).

SEM

For the de-etiolation experiment of rice and sorghum, samples from four to six individual seedlings for each time point (0 h, 6 h, 12 h and 48 h) were collected for electron microscopy. Leaf segments (around 2 mm²) were excised with a razor blade and immediately fixed in 2% (v/v) glutaraldehyde and 2% (w/v) formaldehyde in 0.05–0.1 M sodium cacodylate (NaCac) buffer (pH 7.4) containing 2 mM calcium chloride. Samples were vacuum infiltrated overnight, washed five times in 0.05–0.1 M NaCac buffer and post-fixed in 1% (v/v) aqueous osmium tetroxide, 1.5% (w/v) potassium ferricyanide in 0.05 M NaCac buffer for three days at 4 °C. After osmication, samples were washed five times in deionized water and post-fixed in 0.1% (w/v) thiocarbohydrazide for 20 min at room temperature in the dark. Samples were then washed five times in deionized water and osmicated for a second time for 1 h in 2% (v/v) aqueous osmium tetroxide at room temperature. Samples were washed five times in deionized water and subsequently stained in 2% (w/v) uranyl acetate in 0.05 M maleate buffer (pH 5.5) for three days at 4 °C and washed five times afterwards in deionized water. Samples were then dehydrated in an ethanol series, and transferred to acetone and then to acetonitrile. Leaf samples were embedded in Quetol 651 resin mix (TAAB Laboratories Equipment) and cured at 60 °C for two days. Ultra-thin sections of embedded leaf samples were prepared and placed on Melinex (TAAB Laboratories Equipment) plastic coverslips mounted on aluminium SEM stubs using conductive carbon tabs (TAAB Laboratories Equipment), sputter-coated with a thin layer of carbon (around 30 nm) to avoid charging, and imaged in a Verios 460 scanning electron microscope at a 4 keV accelerating voltage and 0.2 nA probe current using the concentric backscatter detector in field-free (low-magnification) or immersion (high-magnification) mode (working distance 3.5–4 mm, dwell time 3 µs, 1,536 × 1,024 pixel resolution). For overserving plastid ultrastructure, SEM stitched maps were acquired at 10,000× magnification using the FEI MAPS automated acquisition software. Greyscale contrast of the images was inverted to allow easier visualization.

Enrichment of bundle-sheath nuclei using fluorescence-activated cell sorting

To purify the nuclei population from whole leaves, recently expanded leaves 3 from five seven-day-old wild-type rice seedlings were chopped on ice in nuclei buffer (10 mM Tris-HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl₂, 0.5 mM spermidine, 0.2 mM spermine, 0.01% Triton X, 1× Roche complete protease inhibitors, 1% BSA and Protector RNase inhibitor) with a sharp razor blade. The suspension was filtered through a 70-mm filter and subsequently through a 35-mm filter. Nuclei were stained with Hoechst and purified by fluorescence-activated cell sorting (FACS) on an AriaIII instrument, using a 70-mm nozzle. Nuclei were collected in an Eppendorf tube containing BSA and Protector RNase inhibitor. Using the same approach, nuclei from the bundle-sheath marker line expressing mTurquoise2-H2B were isolated. Nuclei were sorted on the basis of the mTurquoise2 fluorescent signal. Nuclei were collected in minimal nuclei buffer (10 mM Tris-HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl₂, RNase inhibitor and 0.05% BSA). After collection, nuclei were spun down in a swinging bucket centrifuge at 405g for 5 min, with reduced acceleration and deceleration. Nuclei were resuspended in minimal nuclei buffer and mixed with the unspun whole leaf nuclei population to achieve a proportion of approximately 25% mTurquoise2-positive nuclei. The bundle-sheath enriched nuclei population was sequenced using the 10X Genomics Gene Expression platform with v.3.1 chemistry, and sequenced on the Illumina NovaSeq 6000 with 150-bp paired-end chemistry.

Chlorophyll quantification

Seedlings were harvested at specified time points during de-etiolation and immediately flash-frozen in liquid nitrogen. Frozen tissue was ground into fine powder and the weight was measured before suspending the tissue in 1 ml of 80% (v/v) acetone. After vortexing, the tissue was incubated on ice for 15 min with occasional mixing of the suspension. The tissue was spun down at 15,700g at 4 °C and the supernatant was removed. The extraction was repeated, and supernatants were pooled before measuring the absorbance at 663.6 nm and 646.6 nm in a spectrophotometer. The total chlorophyll content was determined as described previously⁵⁰.

Nuclei extraction and single-nucleus RNA-seq (10X RNA-seq)

Frozen tissue from each time point (one biological replicate per time point, eight time points) was crushed using a bead bashing approach, and nuclei were released from homogenate by resuspending in nuclei buffer (10 mM Tris-HCl, pH 7.4, 10 mM NaCl and 3 mM MgCl₂). The resulting suspension was passed through a 30-μm filter. To enrich the filtered solution for nuclei, an Optiprep (Sigma) gradient was used. Enriched nuclei were then stained with Hoechst, before being FACS purified (BD Influx Software v.1.2.0.142). Purified nuclei were run on the 10X Gene Expression platform with v.3.0 chemistry, and sequenced on the Illumina NovaSeq 6000 with 150-bp paired-end chemistry. Single-cell libraries were made following the manufacturers protocol. Libraries were sequenced to an average saturation of 63% (14% s.d.) and aligned either to the rice (O. sativa, subspecies Nipponbare; MSU annotation)⁵¹ or sorghum (S. bicolor v.3.0.1; JGI annotation)⁵² genome. Chloroplast and mitochondrial reads were removed. For each time point, an average of 12,524 nuclei were sequenced (6,405 s.d.), with an average median unique molecular identifier (UMI) of 1,152 (420 s.d.) across both species. Doublets were removed using doubletFinder⁵³.

Nuclei extraction and single-nucleus RNA-seq (sci-RNA-seq3)

Each individual frozen seedling (10–12 individual seedlings per time point) was crushed using a bead bashing approach in a 96-well plate, after which homogenate was resuspended in nuclei buffer. Resulting suspensions were passed through a 30-μm filter. Washed nuclei were then reverse-transcribed with a well-specific primer. After this step, remaining pool and split steps for sci-RNA-seq3 were followed as outlined previously²⁶. We note the same approach was used to sequence the 48-h time point; however, a population of six plants was used instead of individual seedlings. Libraries were sequenced to an average saturation of 80% (5% s.d.), and sequenced on the Illumina NovaSeq 6000 with 150-bp paired-end chemistry. Reads were aligned to either the rice or the sorghum genome, as described above. Chloroplast and mitochondrial reads were removed. For 0–12-h time points, an average of 6,527 nuclei were sequenced (5,039 s.d.), with an average median UMI of 423 (41 s.d.) across both species. For the 48-h time point, 77,208 and 82,748 nuclei were sequenced with a median UMI of 757 and 740 for rice and sorghum, respectively.

Nuclei extraction and single-nucleus RNA-seq (10X Multiome)

Fresh seedling tissue was collected after 0 or 12 h light treatment (two biological replicates per species, each with two to four technical replicates per time point; n = 11). Fresh tissue was chopped finely on ice in green room conditions in nuclei buffer. The resulting homogenate was filtered using a 30-μm filter. Nuclei were enriched using Optiprep gradient. No FACS was performed. Nuclei were run on the 10X Multiome platform with v.1.0 chemistry. Single-cell libraries were made following the manufacturer’s protocol, and sequenced on the Illumina NovaSeq 6000 with 150-bp paired-end chemistry. Reads were aligned to either the rice or the sorghum genome, as described above. Chloroplast and mitochondrial reads were removed. For each sample, an average of 1,923 nuclei were sequenced (1,334 s.d.), with an average median UMI of 1,644 (646 s.d.) and median ATAC fragments 10,251 (7,001 s.d.) across both species.

Nuclei clustering

Transcriptional atlases were generated separately for each species using Seurat⁵⁴. Nuclei were first aggregated across various time points (ranging from 0 to 48 h) and methods (10X and sci-RNA-seq3). The integrated dataset was subjected to clustering, using the top 2,000 variable features that were shared across all datasets. Each cluster contained nuclei sampled from all time points, indicating that clustering was driven predominantly by cell type rather than by time after exposure to light (Extended Data Fig. 2). Subsequent UMAP projections were constructed using the first 30 principal components. UMAP projections of mesophyll and bundle-sheath sub-clusters in rice and sorghum, respectively, were achieved using genes found to be significantly differentially expressed in response to light as variable features. To analyse the rice bundle-sheath-specific mTurquoise line, we integrated two treatment replicates into a unified dataset. For this dataset, we clustered using the first 30 principal components. Cluster-specific markers were identified using the FindMarkers() command (adjusted P value < 0.01). To determine the correspondence between the mTurquoise-positive cluster and clusters within the rice-RNA atlas, we compared the lists of cluster-specific markers (adjusted P value 0.01, specificity > 2) to those obtained from the rice atlas. For the 10X-multiome (RNA + ATAC) clustering we used Signac⁵⁵. Biological and technical replicates for each species were integrated, and clustering was conducted using the first 50 principal components derived from expression data. After the initial peak calling using Cell Ranger (10X Genomics), peaks were subsequently re-called using MACS2 (ref. ⁵⁶). Differentially accessible peaks between cell types were identified using the FindMarkers() command (adjusted P value < 0.05, per cent threshold > 0.3), before being associated with the nearest gene (±2,000 bp from transcription start site)

Orthology analyses

We determined gene orthologues between rice and sorghum using OrthoFinder⁵⁷. We constructed pan-transcriptome atlases by selecting expressed rice and sorghum genes that had cross-species orthologues. To construct the pan-transcriptome atlas, orthologue conversions were performed in a one-to-one manner, meaning that if multiple orthologues for a gene were found across species, only one was retained. We integrated these datasets with Seurat using the clustering approaches described above. To assign cell identities, we drew on cell-type labels that were previously assigned to each species separately and mapped them onto the pan-transcriptome clusters. To assess specific transcriptional differences in gene expression between the bundle-sheath clusters of sorghum and rice within this dataset, we used the FindMarkers() command (adjusted P value < 0.05). Sorghum DOF transcription factor orthologue names kept the same numerical identifier as their rice orthologues.

To examine the overlap of cell-type-specific gene-expression markers between the two species, we identified cell-type markers from our main transcriptional dataset using FindMarkers() (adjusted P value < 0.05, min.pct > 0.1). We note that some genes were found to be significant across multiple cell types. To assess the significance of the overlap between cell types across species, we converted genes to orthogroups and conducted a Fisher’s exact test, with the total number of orthogroups in the dataset as the background. The proportion of conserved marker genes for each cell type across species ranged from 43% for mesophyll (184 out of 426 rice marker genes conserved in sorghum) to 13% for bundle sheath (31 out of 229 rice marker genes conserved in sorghum). We note that by relying on orthogroups, we included higher-order orthology relationships beyond a one-to-one manner.

Next, we assessed consistent and differential partitioning of gene-expression patterns among each cell-type pair (15 pairs total). To do this, we first calculated differentially expressed genes for each cell-type pair by pseudo-bulking transcriptomes of individual cell types across 0–12-h time points. Next, we identified partitioned expression patterns between cell types using an ANCOVA model implemented in DESeq2 (adjusted P < 0.05). To perform cross-species comparisons of cell-type pairs, we first converted differentially expressed genes to their orthogroup. We then overlapped each cell-type pair across species, using orthogroup membership, and evaluated the significance of these overlaps using the Fisher’s exact test, with the total number of orthogroups as background. Finally, to distinguish whether a gene displayed consistent or differential partitioning in a particular cell type, we examined whether its fold change expression was higher or lower compared with its counterpart in the corresponding cell type of the other species.

Differential expression and accessibility responses to light

We discovered cell-type-specific differentially expressed genes during the first 12 h of light by pseudo-bulking transcriptional profiles. To create pseudo-bulk profiles for each cell type, we first refined our nuclei clusters through re-clustering mesophyll, epidermal and vasculature cell classes separately, before selecting sub-clusters that most strongly expressed known cell-type marker genes. For each cell type, we calculated the first and second principal component of these bulked profiles and found differentially expressed genes through fitting linear models to each of these principal components, as well as those that responded linearly with time using DESeq2 (adjusted P < 0.05). We treated the assay with which the nuclei were sequenced (10X or sci-RNA-seq3) as a covariate. In this list of differentially expressed genes, we also included genes that were differentially expressed between time points 0 h and 12 h in a pairwise test (adjusted P < 0.05). Next, to uncover the different trends of gene expression among differentially expressed genes, we clustered genes using hierarchical clustering, choosing clustering cut-offs that resulted in 10 rice and 18 sorghum clusters that contained at least 10 genes. To visualize the expression of these clusters, we scaled the expression and fitted a non-linear model to capture the dominant expression trend. Accessible chromatin within canonical photosynthesis genes was found through pseudo-bulking accessible chromatin by cell type. Accessible peaks needed to be within 2,000 bp of the gene body. Only one peak per gene was retained for subsequent analyses, and extreme outliers were removed (around 5% of called peaks). To compare peak accessibility across species, reads per peak were re-normalized between 0 and 1. Significant differences in accessibility between cell types of this group of genes were assessed using a Student’s t-test (one-sided).

GO analyses

To identify GO terms associated with cell-type-specific genes and genes that swap expression patterns in rice and sorghum leaves, we performed singular enrichment analysis using the web-based tool AgriGO v.2.0 (ref. ⁵⁸). Oryza sativa or S. bicolor gene identifiers were used for the input sample list, and the whole genome of the respective plant species was used as background.

Cis-element analyses

We detected cell-type-specific accessible motifs within each cell type using the chromVAR function⁵⁹ implemented in Signac. In brief, this approach detected over-represented cis-regulatory elements within the JASPAR2020 plant taxon group⁶⁰ among peaks that are differentially accessible across cell-type clusters. GC enrichment and genomic backgrounds used for statistical tests were derived from BSGenome assembled genomes⁶¹. The same approach was also used to detect light-responsive cis-elements, using light- and dark-treated nuclei within each cell type. We overlapped enriched cis-regulatory elements identified across species by selecting the top 25 most significantly over-represented motifs (adjusted P < 0.05), before computing a Fisher’s exact test using all computed motifs as background, and then clustered the resulting motifs using TOBIAS⁶².

To find consistently and differentially partitioned orthologous genes within our multiome gene-expression dataset, we found mesophyll and bundle-sheath-specific genes in rice and sorghum, respectively, using the FindMarkers() command, with a P value threshold cut-off of 0.01 and an expression specificity above 1.25. To find over-represented motifs within differentially partitioned genes, we correlated peak accessibility with gene expression using the LinkPeaks() command and kept only those peaks which were significantly associated with gene expression. We identified enriched cis-elements within these peaks using the FindMotifs() command; ranking by significance (adjusted P < 0.05). Because the resulting significance depends on the subset of the genome chosen as background, we iterated the FindMotifs() command over 100 permutations to rank motifs that were consistently reported as enriched. We then averaged each motif’s respective rank across the 100 permutations to create a final ranked value (Supplementary Table 13).

To quantify the occurrence of DOF-binding sites, we extracted the genomic sequence of peaks that were proximal to the transcription start site (±1,500 bp). If a peak was proximal to two transcription start sites, it was assigned to the closer one. We then implemented Find Individual Motif Occurrences (FIMO) to quantify the number of DOF consensus sites within these chromatin regions (P value threshold = 0.005). We chose the DOF2 (MA0020.1) motif as representative of the core DOF consensus sequence AAAG.

We implemented analysis of motif enrichment (AME) to detect DOF transcription factor motifs enriched within C. laxum (http://phytozome-next.jgi.doe.gov/info/Claxum_v1_1), H. vulgare (Hvulgare_r1)⁶³ or B. distachyon (Bdistachyon_314_v3.0)⁶⁴ homologues of genes consistently partitioned to the rice and sorghum bundle sheath. To identify homologues, the NCBI BLASTN tool v.2.15.0 was used by comparing coding sequences, and the top identified homologue for each gene was selected for cis-element enrichment analyses. We used 1,000 bp upstream of the transcription start site for each homologous gene and tested against reported plant motifs present within the JASPAR database.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

November 20, 2024
The emerging view on the origin and early evolution of eukaryotic cells

[ad_1]
Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977). This seminal paper was the first to recognize archaea—then called archaebacteria—as a separate prokaryotic group from bacteria.

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Woese, C. R. Bacterial evolution. Microbiol. Rev. 51, 221–271 (1987).

Article
CAS
PubMed
PubMed Central

Google Scholar
Woese, C. R., Kandler, O. & Wheelis, M. L. Towards a natural system of organisms: proposal for the domains archaea, bacteria, and eucarya. Proc. Natl Acad. Sci. USA 87, 4576–4579 (1990).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Huet, J., Schnabel, R., Sentenac, A. & Zillig, W. Archaebacteria and eukaryotes possess DNA-dependent RNA polymerases of a common type. EMBO J. 2, 1291–1294 (1983).

Article
CAS
PubMed
PubMed Central

Google Scholar
Ouzounis, C. & Sander, C. TFIIB, an evolutionary link between the transcription machineries of archaebacteria and eukaryotes. Cell 71, 189–190 (1992).

Article
CAS
PubMed

Google Scholar
Myllykallio, H. et al. Bacterial mode of replication with eukaryotic-like machinery in a hyperthermophilic archaeon. Science 288, 2212–2215 (2000).

Article
ADS
CAS
PubMed

Google Scholar
Williams, T. A., Cox, C. J., Foster, P. G., Szöllősi, G. J. & Embley, T. M. Phylogenomics provides robust support for a two-domains tree of life. Nat. Ecol. Evol. 4, 138–147 (2020). Using better-fitting models and additional in-depth analyses, this study scrutinized previous studies that reported 3D trees, resulting in robust 2D trees that show a close relationship between Heimdallarchaeia and eukaryotes.

Article
PubMed

Google Scholar
Eme, L. et al. Inference and reconstruction of the heimdallarchaeial ancestry of eukaryotes. Nature 618, 992–999 (2023). This study presented the expanding diversity of Asgard archaea, the Hodarchaeales–sister relationship of eukaryotes based on elaborate phylogenomics, the presence of additional ESPs in Asgard genomes and the reconstructed gene content of Asgard ancestral nodes.

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Williams, T. A., Foster, P. G., Cox, C. J. & Embley, T. M. An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504, 231–236 (2013).

Article
ADS
CAS
PubMed

Google Scholar
Betts, H. C. et al. Integrated genomic and fossil evidence illuminates life’s early evolution and eukaryote origin. Nat. Ecol. Evol. 2, 1556–1562 (2018).

Article
PubMed
PubMed Central

Google Scholar
Mahendrarajah, T. A. et al. ATP synthase evolution on a cross-braced dated tree of life. Nat. Commun. 14, 7456 (2023).

Article
ADS
PubMed
PubMed Central

Google Scholar
Eme, L., Sharpe, S. C., Brown, M. W. & Roger, A. J. On the age of eukaryotes: evaluating evidence from fossils and molecular clocks. Cold Spring Harb. Perspect. Biol. 6, a016139 (2014).

Article
PubMed
PubMed Central

Google Scholar
Cohen, P. A. & Kodner, R. B. The earliest history of eukaryotic life: uncovering an evolutionary story through the integration of biological and geological data. Trends Ecol. Evol. 37, 246–256 (2022).

Article
PubMed

Google Scholar
Brocks, J. J. et al. Lost world of complex life and the late rise of the eukaryotic crown. Nature 618, 767–773 (2023).

Article
ADS
CAS
PubMed

Google Scholar
Porter, S. M. & Riedman, L. A. Frameworks for interpreting the early fossil record of eukaryotes. Annu. Rev. Microbiol. 77, 173–191 (2023).

Article
CAS
PubMed

Google Scholar
Koumandou, V. L. et al. Molecular paleontology and complexity in the last eukaryotic common ancestor. Crit. Rev. Biochem. Mol. Biol. 48, 373–396 (2013).

Article
CAS
PubMed
PubMed Central

Google Scholar
Donoghue, P. C. J. et al. Defining eukaryotes to dissect eukaryogenesis. Curr. Biol. 33, R919–R929 (2023).

Article
CAS
PubMed

Google Scholar
Makarova, K. S., Wolf, Y. I., Mekhedov, S. L., Mirkin, B. G. & Koonin, E. V. Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell. Nucleic Acids Res. 33, 4626–4638 (2005). This paper provided a first systematic estimate of the number of gene acquisitions, duplications and inventions during eukaryogenesis based on the homology between eukaryotic clusters of orthologues and between eukaryotic and prokaryotic gene clusters.

Article
CAS
PubMed
PubMed Central

Google Scholar
O’Malley, M. A., Leger, M. M., Wideman, J. G. & Ruiz-Trillo, I. Concepts of the last eukaryotic common ancestor. Nat. Ecol. Evol. 3, 338–344 (2019).

Article
PubMed

Google Scholar
Eme, L., Spang, A., Lombard, J., Stairs, C. W. & Ettema, T. J. G. Archaea and the origin of eukaryotes. Nat. Rev. Microbiol. 15, 711–723 (2017).

Article
CAS
PubMed

Google Scholar
Dacks, J. B. et al. The changing view of eukaryogenesis—fossils, cells, lineages and how they all come together. J. Cell Sci. 129, 3695–3703 (2016).

Article
CAS
PubMed

Google Scholar
Woese, C. R. & Olsen, G. J. Archaebacterial phylogeny: perspectives on the Urkingdoms. Syst. Appl. Microbiol. 7, 161–177 (1986).

Article
CAS
PubMed

Google Scholar
Lake, J. A. Origin of the eukaryotic nucleus determined by rate-invariant analysis of rRNA sequences. Nature 331, 184–186 (1988).

Article
ADS
CAS
PubMed

Google Scholar
Gouy, M. & Li, W.-H. Phylogenetic analysis based on rRNA sequences supports the archaebacterial rather than the eocyte tree. Nature 339, 145–147 (1989).

Article
ADS
CAS
PubMed

Google Scholar
Iwabe, N., Kuma, K., Hasegawa, M., Osawa, S. & Miyata, T. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Natl Acad. Sci. USA 86, 9355–9359 (1989).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Baldauf, S. L., Palmer, J. D. & Doolittle, W. F. The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc. Natl Acad. Sci. USA 93, 7749–7754 (1996).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Lake, J. A., Henderson, E., Oakes, M. & Clark, M. W. Eocytes: a new ribosome structure indicates a kingdom with a close relationship to eukaryotes. Proc. Natl Acad. Sci. USA 81, 3786–3790 (1984). On the basis of ribosome structures, the authors of this study postulated the eocyte hypothesis, in which eukaryotes are most closely related to a specific group of archaea (the 2D tree).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Rivera, M. C. & Lake, J. A. Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science 257, 74–76 (1992).

Article
ADS
CAS
PubMed

Google Scholar
Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. Universal trees based on large combined protein sequence data sets. Nat. Genet. 28, 281–285 (2001).

Article
CAS
PubMed

Google Scholar
Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).

Article
ADS
CAS
PubMed

Google Scholar
Cox, C. J., Foster, P. G., Hirt, R. P., Harris, S. R. & Embley, T. M. The archaebacterial origin of eukaryotes. Proc. Natl Acad. Sci. USA 105, 20356–20361 (2008). Using phylogenetic models that take compositional changes into account, the 2D tree was robustly recovered for the first time in this phylogenomics study.

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Foster, P. G., Cox, C. J. & Embley, T. M. The primary divisions of life: a phylogenomic approach employing composition-heterogeneous methods. Phil. Trans. R. Soc. B 364, 2197–2207 (2009).

Article
PubMed
PubMed Central

Google Scholar
Guy, L. & Ettema, T. J. G. The archaeal ‘TACK’ superphylum and the origin of eukaryotes. Trends Microbiol. 19, 580–587 (2011).

Article
CAS
PubMed

Google Scholar
Kelly, S., Wickstead, B. & Gull, K. Archaeal phylogenomics provides evidence in support of a methanogenic origin of the archaea and a thaumarchaeal origin for the eukaryotes. Proc. R. Soc. B 278, 1009–1018 (2011).

Article
CAS
PubMed

Google Scholar
Lasek-Nesselquist, E. & Gogarten, J. P. The effects of model choice and mitigating bias on the ribosomal tree of life. Mol. Phylogenetics Evol. 69, 17–38 (2013).

Article

Google Scholar
Guy, L., Saw, J. H. & Ettema, T. J. G. The archaeal legacy of eukaryotes: a phylogenomic perspective. Cold Spring Harb. Perspect. Biol. 6, a016022 (2014).

Article
PubMed
PubMed Central

Google Scholar
Williams, T. A. & Embley, T. M. Archaeal “dark matter” and the origin of eukaryotes. Genome Biol. Evol. 6, 474–481 (2014).

Article
PubMed
PubMed Central

Google Scholar
Raymann, K., Brochier-Armanet, C. & Gribaldo, S. The two-domain tree of life is linked to a new root for the Archaea. Proc. Natl Acad. Sci. USA 112, 6670–6675 (2015).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015). This paper described the discovery of the first Asgard archaeon, Lokiarchaeum, and showed both its close relationship with eukaryotes and the presence of multiple new ESPs in its genome.

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Seitz, K. W., Lazar, C. S., Hinrichs, K.-U., Teske, A. P. & Baker, B. J. Genomic reconstruction of a novel, deeply branched sediment archaeal phylum with pathways for acetogenesis and sulfur reduction. ISME J. 10, 1696–1705 (2016).

Article
CAS
PubMed
PubMed Central

Google Scholar
Zaremba-Niedzwiedzka, K. et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541, 353–358 (2017).

Article
ADS
CAS
PubMed

Google Scholar
Spang, A. et al. Proposal of the reverse flow model for the origin of the eukaryotic cell based on comparative analyses of Asgard archaeal metabolism. Nat. Microbiol. 4, 1138–1148 (2019).

Article
CAS
PubMed

Google Scholar
Seitz, K. W. et al. Asgard archaea capable of anaerobic hydrocarbon cycling. Nat. Commun. 10, 1822 (2019).

Article
ADS
PubMed
PubMed Central

Google Scholar
Imachi, H. et al. Isolation of an archaeon at the prokaryote–eukaryote interface. Nature 577, 519–525 (2020). This study presented the first cultured Asgard archaeon, the lokiarchaeon Candidatus P. syntrophicum, showing remarkable cell physiology (see also ref. 50).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Liu, Y. et al. Expanded diversity of Asgard archaea and their relationships with eukaryotes. Nature 593, 553–557 (2021).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Sun, J. et al. Recoding of stop codons expands the metabolic potential of two novel Asgardarchaeota lineages. ISME Commun. 1, 30 (2021).

Article
PubMed
PubMed Central

Google Scholar
Aouad, M. et al. A divide-and-conquer phylogenomic approach based on character supermatrices resolves early steps in the evolution of the archaea. BMC Ecol. Evo. 22, 1 (2022).

Article

Google Scholar
Wu, F. et al. Unique mobile elements and scalable gene flow at the prokaryote–eukaryote boundary revealed by circularized Asgard archaea genomes. Nat. Microbiol. 7, 200–212 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Xie, R. et al. Expanding Asgard members in the domain of archaea sheds new light on the origin of eukaryotes. Sci. China Life Sci. 65, 818–829 (2022).

Article
CAS
PubMed

Google Scholar
Rodrigues-Oliveira, T. et al. Actin cytoskeleton and complex cell architecture in an Asgard archaeon. Nature 613, 332–339 (2023).

Article
ADS
CAS
PubMed

Google Scholar
Da Cunha, V., Gaia, M., Gadelle, D., Nasir, A. & Forterre, P. Lokiarchaea are close relatives of Euryarchaeota, not bridging the gap between prokaryotes and eukaryotes. PLoS Genet. 13, e1006810 (2017).

Article
PubMed
PubMed Central

Google Scholar
Stairs, C. W. & Ettema, T. J. G. The archaeal roots of the eukaryotic dynamic actin cytoskeleton. Curr. Biol. 30, R521–R526 (2020).

Article
CAS
PubMed

Google Scholar
Klinger, C. M., Spang, A., Dacks, J. B. & Ettema, T. J. G. Tracing the archaeal origins of eukaryotic membrane-trafficking system building blocks. Mol. Biol. Evol. 33, 1528–1541 (2016).

Article
CAS
PubMed

Google Scholar
Vosseberg, J. et al. Timing the origin of eukaryotic cellular complexity with ancient duplications. Nat. Ecol. Evol. 5, 92–100 (2021). This paper reconstructed the numerous gene duplications that occurred during eukaryogenesis from phylogenetic trees and inferred their relative timing, also in comparison with gene transfer events, using the branch lengths approach adapted from ref. 127.

Article
PubMed

Google Scholar
Szöllősi, G. J., Rosikiewicz, W., Boussau, B., Tannier, E. & Daubin, V. Efficient exploration of the space of reconciled gene trees. Syst. Biol. 62, 901–912 (2013).

Article
PubMed
PubMed Central

Google Scholar
Williams, T. A. et al. Parameter estimation and species tree rooting using ALE and GeneRax. Genome Biol. Evol. 15, evad134 (2023).

Article
PubMed
PubMed Central

Google Scholar
Akıl, C. & Robinson, R. C. Genomes of Asgard archaea encode profilins that regulate actin. Nature 562, 439–443 (2018). This article is the first of a series of biochemical papers investigating the molecular function of Asgard ESPs by expressing them in heterologous systems, in this case focusing on the interaction between Asgard profilin and eukaryotic actin.

Article
ADS
PubMed

Google Scholar
Akıl, C. et al. Insights into the evolution of regulated actin dynamics via characterization of primitive gelsolin/cofilin proteins from Asgard archaea. Proc. Natl Acad. Sci. USA 117, 19904–19913 (2020).

Article
ADS
PubMed
PubMed Central

Google Scholar
Survery, S. et al. Heimdallarchaea encodes profilin with eukaryotic-like actin regulation and polyproline binding. Commun. Biol. 4, 1024 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar
Akıl, C. et al. Structure and dynamics of Odinarchaeota tubulin and the implications for eukaryotic microtubule evolution. Sci. Adv. 8, eabm2225 (2022).

Article
PubMed
PubMed Central

Google Scholar
Leung, K. F., Dacks, J. B. & Field, M. C. Evolution of the multivesicular body ESCRT machinery; retention across the eukaryotic lineage. Traffic 9, 1698–1716 (2008).

Article
CAS
PubMed

Google Scholar
Hatano, T. et al. Asgard archaea shed light on the evolutionary origins of the eukaryotic ubiquitin–ESCRT machinery. Nat. Commun. 13, 3398 (2022).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Neveu, E., Khalifeh, D., Salamin, N. & Fasshauer, D. Prototypic SNARE proteins are encoded in the genomes of Heimdallarchaeota, potentially bridging the gap between the prokaryotes and eukaryotes. Curr. Biol. 30, 2468–2480 (2020).

Article
CAS
PubMed

Google Scholar
Avcı, B. et al. Spatial separation of ribosomes and DNA in Asgard archaeal cells. ISME J. 16, 606–610 (2022).

Article
PubMed

Google Scholar
Gray, M. W., Burger, G. & Lang, B. F. Mitochondrial evolution. Science 283, 1476–1481 (1999).

Article
ADS
CAS
PubMed

Google Scholar
Roger, A. J., Muñoz-Gómez, S. A. & Kamikawa, R. The origin and diversification of mitochondria. Curr. Biol. 27, R1177–R1192 (2017).

Article
CAS
PubMed

Google Scholar
Yang, D., Oyaizu, Y., Oyaizu, H., Olsen, G. J. & Woese, C. R. Mitochondrial origins. Proc. Natl Acad. Sci. USA 82, 4443–4447 (1985).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Fitzpatrick, D. A., Creevey, C. J. & McInerney, J. O. Genome phylogenies indicate a meaningful α-proteobacterial phylogeny and support a grouping of the mitochondria with the Rickettsiales. Mol. Biol. Evol. 23, 74–85 (2006).

Article
CAS
PubMed

Google Scholar
Williams, K. P., Sobral, B. W. & Dickerman, A. W. A robust species tree for the Alphaproteobacteria. J. Bacteriol. 189, 4578–4586 (2007).

Article
CAS
PubMed
PubMed Central

Google Scholar
Thrash, J. C. et al. Phylogenomic evidence for a common ancestor of mitochondria and the SAR11 clade. Sci. Rep. 1, 13 (2011).

Article
PubMed
PubMed Central

Google Scholar
Georgiades, K., Madoui, M.-A., Le, P., Robert, C. & Raoult, D. Phylogenomic analysis of Odyssella thessalonicensis fortifies the common origin of Rickettsiales, Pelagibacter ubique and Reclimonas americana mitochondrion. PLoS ONE 6, e24857 (2011).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Sassera, D. et al. Phylogenomic evidence for the presence of a flagellum and cbb₃ oxidase in the free-living mitochondrial ancestor. Mol. Biol. Evol. 28, 3285–3296 (2011).

Article
CAS
PubMed

Google Scholar
Rodríguez-Ezpeleta, N. & Embley, T. M. The SAR11 group of alpha-proteobacteria is not related to the origin of mitochondria. PLoS ONE 7, e30520 (2012).

Article
ADS
PubMed
PubMed Central

Google Scholar
Viklund, J., Martijn, J., Ettema, T. J. G. & Andersson, S. G. E. Comparative and phylogenomic evidence that the alphaproteobacterium HIMB59 is not a member of the oceanic SAR11 clade. PLoS ONE 8, e78858 (2013).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Wang, Z. & Wu, M. Phylogenomic reconstruction indicates mitochondrial ancestor was an energy parasite. PLoS ONE 9, e110685 (2014).

Article
ADS
PubMed
PubMed Central

Google Scholar
Wang, Z. & Wu, M. An integrated phylogenomic approach toward pinpointing the origin of mitochondria. Sci. Rep. 5, 7949 (2015).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Martijn, J., Vosseberg, J., Guy, L., Offre, P. & Ettema, T. J. G. Deep mitochondrial origin outside the sampled Alphaproteobacteria. Nature 557, 101–105 (2018). This study recovered several novel marine alphaproteobacterial groups and performed careful phylogenomic analyses to address long-branch and compositional artefacts, revealing the novel Alphaproteobacteria–sister position of mitochondria.

Article
ADS
CAS
PubMed

Google Scholar
Fan, L. et al. Phylogenetic analyses with systematic taxon sampling show that mitochondria branch within Alphaproteobacteria. Nat. Ecol. Evol. 4, 1213–1219 (2020).

Article
PubMed

Google Scholar
Wang, S. & Luo, H. Dating Alphaproteobacteria evolution with eukaryotic fossils. Nat. Commun. 12, 3324 (2021).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Muñoz-Gómez, S. A. et al. Site-and-branch-heterogeneous analyses of an expanded dataset favour mitochondria as sister to known Alphaproteobacteria. Nat. Ecol. Evol. 6, 253–262 (2022). This study corroborated the Alphaproteobacteria–sister relationship of mitochondria using a newly developed model that accounts for compositional heterogeneity across sites and branches.

Article
PubMed

Google Scholar
Martijn, J., Vosseberg, J., Guy, L., Offre, P. & Ettema, T. J. G. Phylogenetic affiliation of mitochondria with Alpha-II and Rickettsiales is an artefact. Nat. Ecol. Evol. 6, 1829–1831 (2022).

Article
PubMed

Google Scholar
Fan, L. et al. Reply to: Phylogenetic affiliation of mitochondria with Alpha-II and Rickettsiales is an artefact. Nat. Ecol. Evol. 6, 1832–1835 (2022).

Article
PubMed

Google Scholar
Ettema, T. J. G. & Andersson, S. G. E. The α-proteobacteria: the Darwin finches of the bacterial world. Biol. Lett. 5, 429–432 (2009).

Article
PubMed
PubMed Central

Google Scholar
Martin, W. F., Garg, S. & Zimorski, V. Endosymbiotic theories for eukaryote origin. Phil. Trans. R. Soc. B 370, 20140330 (2015).

Article
PubMed
PubMed Central

Google Scholar
Martin, W. & Müller, M. The hydrogen hypothesis for the first eukaryote. Nature 392, 37–41 (1998).

Article
ADS
CAS
PubMed

Google Scholar
Sousa, F. L., Neukirchen, S., Allen, J. F., Lane, N. & Martin, W. F. Lokiarchaeon is hydrogen dependent. Nat. Microbiol. 1, 16034 (2016).

Article
CAS
PubMed

Google Scholar
Moreira, D. & López-García, P. Symbiosis between methanogenic archaea and δ-proteobacteria as the origin of eukaryotes: the syntrophic hypothesis. J. Mol. Evol. 47, 517–530 (1998).

Article
ADS
CAS
PubMed

Google Scholar
López-García, P. & Moreira, D. The syntrophy hypothesis for the origin of eukaryotes revisited. Nat. Microbiol. 5, 655–667 (2020).

Article
PubMed

Google Scholar
Bulzu, P.-A. et al. Casting light on Asgardarchaeota metabolism in a sunlit microoxic niche. Nat. Microbiol. 4, 1129–1137 (2019).

Article
CAS
PubMed

Google Scholar
Mills, D. B. et al. Eukaryogenesis and oxygen in Earth history. Nat. Ecol. Evol. 6, 520–532 (2022).

Article
PubMed

Google Scholar
Muñoz-Gómez, S. A., Wideman, J. G., Roger, A. J. & Slamovits, C. H. The origin of mitochondrial cristae from Alphaproteobacteria. Mol. Biol. Evol. 34, 943–956 (2017).

PubMed

Google Scholar
Gabaldón, T. & Huynen, M. A. Reconstruction of the proto-mitochondrial metabolism. Science 301, 609–609 (2003).

Article
PubMed

Google Scholar
Gabaldón, T. & Huynen, M. A. From endosymbiont to host-controlled organelle: the hijacking of mitochondrial protein synthesis and metabolism. PLoS Comput. Biol. 3, e219 (2007).

Article
ADS
PubMed
PubMed Central

Google Scholar
Stairs, C. W., Leger, M. M. & Roger, A. J. Diversity and origins of anaerobic metabolism in mitochondria and related organelles. Phil. Trans. R. Soc. B 370, 20140326 (2015).

Article
PubMed
PubMed Central

Google Scholar
Stairs, C. W. et al. Chlamydial contribution to anaerobic metabolism during eukaryotic evolution. Sci. Adv. 6, eabb7258 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Speijer, D. Alternating terminal electron-acceptors at the basis of symbiogenesis: How oxygen ignited eukaryotic evolution. BioEssays 39, 1600174 (2017).

Article

Google Scholar
Cavalier-Smith, T. The phagotrophic origin of eukaryotes and phylogenetic classification of Protozoa. Int. J. Syst. Evol. Microbiol. 52, 297–354 (2002).

Article
CAS
PubMed

Google Scholar
Martijn, J. & Ettema, T. J. G. From archaeon to eukaryote: the evolutionary dark ages of the eukaryotic cell. Biochem. Soc. Trans. 41, 451–457 (2013).

Article
CAS
PubMed

Google Scholar
Zachar, I., Szilágyi, A., Számadó, S. & Szathmáry, E. Farming the mitochondrial ancestor as a model of endosymbiotic establishment by natural selection. Proc. Natl Acad. Sci. USA 115, E1504–E1510 (2018).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Baum, D. A. & Baum, B. An inside-out origin for the eukaryotic cell. BMC Biol. 12, 76 (2014).

Article
PubMed
PubMed Central

Google Scholar
Mills, D. B. The origin of phagocytosis in Earth history. Interface Focus 10, 20200019 (2020).

Article
PubMed
PubMed Central

Google Scholar
Bremer, N., Tria, F. D. K., Skejo, J., Garg, S. G. & Martin, W. F. Ancestral state reconstructions trace mitochondria but not phagocytosis to the last eukaryotic common ancestor. Genome Biol. Evol. 14, evac079 (2022).

Article
PubMed
PubMed Central

Google Scholar
Yutin, N., Wolf, M. Y., Wolf, Y. I. & Koonin, E. V. The origins of phagocytosis and eukaryogenesis. Biol. Direct 4, 9 (2009).

Article
PubMed
PubMed Central

Google Scholar
Hugoson, E., Guliaev, A., Ammunét, T. & Guy, L. Host adaptation in Legionellales Is 1.9 Ga, coincident with eukaryogenesis. Mol. Biol. Evol. 39, msac037 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Martin, W. F., Tielens, A. G. M., Mentel, M., Garg, S. G. & Gould, S. B. The physiology of phagocytosis in the context of mitochondrial origin. Microbiol. Mol. Biol. Rev. 81, e00008–e00017 (2017).

Article
CAS
PubMed
PubMed Central

Google Scholar
Hampl, V., Čepička, I. & Eliáš, M. Was the mitochondrion necessary to start eukaryogenesis? Trends Microbiol. 27, 96–104 (2019).

Article
CAS
PubMed

Google Scholar
Shiratori, T., Suzuki, S., Kakizawa, Y. & Ishida, K. Phagocytosis-like cell engulfment by a planctomycete bacterium. Nat. Commun. 10, 5529 (2019).

Article
ADS
PubMed
PubMed Central

Google Scholar
Burns, J. A., Pittis, A. A. & Kim, E. Gene-based predictive models of trophic modes suggest Asgard archaea are not phagocytotic. Nat. Ecol. Evol. 2, 697–704 (2018).

Article
PubMed

Google Scholar
Cavalier-Smith, T. Archaebacteria and archezoa. Nature 339, 100–101 (1989).

Article
ADS

Google Scholar
Embley, T. M. & Hirt, R. P. Early branching eukaryotes? Curr. Opin. Genet. Dev. 8, 624–629 (1998).

Article
CAS
PubMed

Google Scholar
Ettema, T. J. G. Evolution: mitochondria in the second act. Nature 531, 39–40 (2016).

Article
ADS
CAS
PubMed

Google Scholar
Lane, N. & Martin, W. The energetics of genome complexity. Nature 467, 929–934 (2010).

Article
ADS
CAS
PubMed

Google Scholar
Lane, N. Energetics and genetics across the prokaryote–eukaryote divide. Biol. Direct 6, 35 (2011).

Article
PubMed
PubMed Central

Google Scholar
Booth, A. & Doolittle, W. F. Eukaryogenesis, how special really? Proc. Natl Acad. Sci. USA 112, 10278–10285 (2015).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Lynch, M. & Marinov, G. K. The bioenergetic costs of a gene. Proc. Natl Acad. Sci. USA 112, 15690–15695 (2015).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Koonin, E. V. Energetics and population genetics at the root of eukaryotic cellular and genomic complexity. Proc. Natl Acad. Sci. USA 112, 15777–15778 (2015).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Lynch, M. & Marinov, G. K. Membranes, energetics, and evolution across the prokaryote–eukaryote divide. eLife 6, e20437 (2017).

Article
PubMed
PubMed Central

Google Scholar
Lane, N. Serial endosymbiosis or singular event at the origin of eukaryotes? J. Theor. Biol. 434, 58–67 (2017).

Article
ADS
PubMed

Google Scholar
Chiyomaru, K. & Takemoto, K. Revisiting the hypothesis of an energetic barrier to genome complexity between eukaryotes and prokaryotes. R. Soc. Open Sci. 7, 191859 (2020).

Article
ADS
PubMed
PubMed Central

Google Scholar
Lane, N. How energy flow shapes cell evolution. Curr. Biol. 30, R471–R476 (2020).

Article
CAS
PubMed

Google Scholar
Schavemaker, P. E. & Muñoz-Gómez, S. A. The role of mitochondrial energetics in the origin and diversification of eukaryotes. Nat. Ecol. Evol. 6, 1307–1317 (2022).

Article
PubMed
PubMed Central

Google Scholar
Volland, J.-M. et al. A centimeter-long bacterium with DNA contained in metabolically active, membrane-bound organelles. Science 376, 1453–1458 (2022).

Article
ADS
CAS
PubMed

Google Scholar
Greening, C. & Lithgow, T. Formation and function of bacterial organelles. Nat. Rev. Microbiol. 18, 677–689 (2020).

Article
CAS
PubMed

Google Scholar
Küper, U., Meyer, C., Müller, V., Rachel, R. & Huber, H. Energized outer membrane and spatial separation of metabolic processes in the hyperthermophilic Archaeon Ignicoccus hospitalis. Proc. Natl Acad. Sci. USA 107, 3152–3156 (2010).

Article
ADS
PubMed
PubMed Central

Google Scholar
Wiegand, S., Jogler, M. & Jogler, C. On the maverick planctomycetes. FEMS Microbiol. Rev. 42, 739–760 (2018).

Article
CAS
PubMed

Google Scholar
Katayama, T. et al. Isolation of a member of the candidate phylum ‘Atribacteria’ reveals a unique cell membrane structure. Nat. Commun. 11, 6381 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Pittis, A. A. & Gabaldón, T. Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry. Nature 531, 101–104 (2016). This study presented a novel approach to use phylogenetic branch lengths to infer the relative timing of gene acquisitions during eukaryogenesis, pointing to rampant bacterial gene flow to stem eukaryotes prior to the proto-mitochondrial acquisition.

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Gabaldón, T. Relative timing of mitochondrial endosymbiosis and the “pre-mitochondrial symbioses” hypothesis. IUBMB Life 70, 1188–1196 (2018).

Article
PubMed
PubMed Central

Google Scholar
Vosseberg, J., Schinkel, M., Gremmen, S. & Snel, B. The spread of the first introns in proto-eukaryotic paralogs. Commun. Biol. 5, 476 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Susko, E., Steel, M. & Roger, A. J. Conditions under which distributions of edge length ratios on phylogenetic trees can be used to order evolutionary events. J. Theor. Biol. 526, 110788 (2021).

Article
MathSciNet
CAS
PubMed

Google Scholar
Tricou, T., Tannier, E. & de Vienne, D. M. Ghost lineages can invalidate or even reverse findings regarding gene flow. PLoS Biol. 20, e3001776 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Fritz-Laylin, L. K. et al. The genome of Naegleria gruberi illuminates early eukaryotic versatility. Cell 140, 631–642 (2010).

Article
CAS
PubMed

Google Scholar
Huynen, M. A., Duarte, I. & Szklarczyk, R. Loss, replacement and gain of proteins at the origin of the mitochondria. Biochim. Biophys. Acta 1827, 224–231 (2013).

Article
CAS
PubMed

Google Scholar
Timmis, J. N., Ayliffe, M. A., Huang, C. Y. & Martin, W. Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes. Nat. Rev. Genet. 5, 123–135 (2004).

Article
CAS
PubMed

Google Scholar
Karnkowska, A. et al. A eukaryote without a mitochondrial organelle. Curr. Biol. 26, 1274–1284 (2016).

Article
CAS
PubMed

Google Scholar
Gabaldón, T. et al. Origin and evolution of the peroxisomal proteome. Biol. Direct 1, 8 (2006).

Article
PubMed
PubMed Central

Google Scholar
Rochette, N. C., Brochier-Armanet, C. & Gouy, M. Phylogenomic test of the hypotheses for the evolutionary origin of eukaryotes. Mol. Biol. Evol. 31, 832–845 (2014).

Article
CAS
PubMed
PubMed Central

Google Scholar
Irwin, N. A. T., Pittis, A. A., Richards, T. A. & Keeling, P. J. Systematic evaluation of horizontal gene transfer between eukaryotes and viruses. Nat. Microbiol. 7, 327–336 (2022).

Article
CAS
PubMed

Google Scholar
Ku, C. et al. Endosymbiotic origin and differential loss of eukaryotic genes. Nature 524, 427–432 (2015).

Article
ADS
CAS
PubMed

Google Scholar
Gould, S. B., Garg, S. G. & Martin, W. F. Bacterial vesicle secretion and the evolutionary origin of the eukaryotic endomembrane system. Trends Microbiol. 24, 525–534 (2016).

Article
CAS
PubMed

Google Scholar
Coleman, G. A., Pancost, R. D. & Williams, T. A. Investigating the origins of membrane phospholipid biosynthesis genes using outgroup-free rooting. Genome Biol. Evol. 11, 883–898 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Volker, C. & Lupas, A. N. in The Proteasome–Ubiquitin Protein Degradation Pathway (eds Zwickl, P. & Baumeister, W.) 1–22 (Springer, 2002).
Vosseberg, J., Stolker, D., von der Dunk, S. H. A. & Snel, B. Integrating phylogenetics with intron positions illuminates the origin of the complex spliceosome. Mol. Biol. Evol. 40, msad011 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar
Tromer, E. C., Hooff, J. J. E., van, Kops, G. J. P. L. & Snel, B. Mosaic origin of the eukaryotic kinetochore. Proc. Natl Acad. Sci. USA 116, 12873–12882 (2019).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Findeisen, P. et al. Six subgroups and extensive recent duplications characterize the evolution of the eukaryotic tubulin protein family. Genome Biol. Evol. 6, 2274–2288 (2014).

Article
CAS
PubMed
PubMed Central

Google Scholar
Muñoz-Gómez, S. A., Bilolikar, G., Wideman, J. G. & Geiler-Samerotte, K. Constructive neutral evolution 20 years later. J. Mol. Evol. 89, 172–182 (2021).

Article
ADS
PubMed
PubMed Central

Google Scholar
Dacks, J. B. & Field, M. C. Evolution of the eukaryotic membrane-trafficking system: origin, tempo and mode. J. Cell Sci. 120, 2977–2985 (2007).

Article
CAS
PubMed

Google Scholar
Dacks, J. B. & Field, M. C. Evolutionary origins and specialisation of membrane transport. Curr. Opin. Cell Biol. 53, 70–76 (2018).

Article
CAS
PubMed
PubMed Central

Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Ekman, D., Björklund, Å. K., Frey-Skött, J. & Elofsson, A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J. Mol. Biol. 348, 231–243 (2005).

Article
CAS
PubMed

Google Scholar
Liu, J. & Rost, B. Comparing function and structure between entire proteomes. Protein Sci. 10, 1970–1979 (2001).

Article
CAS
PubMed
PubMed Central

Google Scholar
Xue, B., Dunker, A. K. & Uversky, V. N. Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life. J. Biomol. Struct. Dyn. 30, 137–149 (2012).

Article
CAS
PubMed

Google Scholar
Colnaghi, M., Lane, N. & Pomiankowski, A. Genome expansion in early eukaryotes drove the transition from lateral gene transfer to meiotic sex. eLife 9, e58873 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
van Dijk, B., Bertels, F., Stolk, L., Takeuchi, N. & Rainey, P. B. Transposable elements promote the evolution of genome streamlining. Phil. Trans. R. Soc. B 377, 20200477 (2022).

Article
PubMed

Google Scholar
Colnaghi, M., Lane, N. & Pomiankowski, A. Repeat sequences limit the effectiveness of lateral gene transfer and favored the evolution of meiotic sex in early eukaryotes. Proc. Natl Acad. Sci. USA 119, e2205041119 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Gilbert, W. Why genes in pieces? Nature 271, 501–501 (1978).

Article
ADS
CAS
PubMed

Google Scholar
Liu, M. & Grigoriev, A. Protein domains correlate strongly with exons in multiple eukaryotic genomes – evidence of exon shuffling? Trends Genet. 20, 399–403 (2004).

Article
PubMed

Google Scholar
Grau-Bové, X. et al. Dynamics of genomic innovation in the unicellular ancestry of animals. eLife 6, e26036 (2017).

Article
PubMed
PubMed Central

Google Scholar
Ocaña-Pallarès, E. et al. Divergent genomic trajectories predate the origin of animals and fungi. Nature 609, 747–753 (2022).

Article
ADS
PubMed
PubMed Central

Google Scholar
Méheust, R. et al. Formation of chimeric genes with essential functions at the origin of eukaryotes. BMC Biol. 16, 30 (2018).

Article
PubMed
PubMed Central

Google Scholar
Tamarit, D. et al. Description of Asgardarchaeum abyssi gen. nov. spec. nov., a novel species within the class Asgardarchaeia and phylum Asgardarchaeota in accordance with the SeqCode. Syst. Appl. Microbiol. 47, 126525 (2024).
Delsuc, F., Brinkmann, H. & Philippe, H. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361–375 (2005).

Article
CAS
PubMed

Google Scholar
Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020).

Article
CAS
PubMed

Google Scholar
Steenwyk, J. L., Li, Y., Zhou, X., Shen, X.-X. & Rokas, A. Incongruence in the phylogenomics era. Nat. Rev. Genet. 24, 834–850 (2023).

Article
CAS
PubMed

Google Scholar
Fleming, J. F., Valero-Gracia, A. & Struck, T. H. Identifying and addressing methodological incongruence in phylogenomics: a review. Evol. Appl. 16, 1087–1104 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar
Foster, P. G. et al. Recoding amino acids to a reduced alphabet may increase or decrease phylogenetic accuracy. Syst. Biol. 72, 723–737 (2023).

Article
CAS
PubMed

Google Scholar
Susko, E. & Roger, A. J. On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 24, 2139–2150 (2007).

Article
CAS
PubMed

Google Scholar
Viklund, J., Ettema, T. J. G. & Andersson, S. G. E. Independent genome reduction and phylogenetic reclassification of the oceanic SAR11 clade. Mol. Biol. Evol. 29, 599–615 (2012).

Article
CAS
PubMed

Google Scholar

Recurrent evolution and selection shape structural diversity at the amylase locus

[ad_1]

Amylase gene naming conventions

The reference genome GRC38 represents an H3 haplotype with three copies of the AMY1 gene and one copy each of the AMY2A and AMY2B genes. The three AMY1 copies are identified with labels AMY1A, AMY1B and AMY1C due to the HUGO naming convention requirements for all gene copies to have unique names. However, these various copies of AMY1 genes across different haplotypes are recent duplications that share high sequence similarity, and therefore are referred to simply as AMY1 genes in this paper and others^2,3,15. By contrast, AMY2A and AMY2B stem from a much older gene duplication event and are much more diverged than the different copies of AMY1 genes¹³. They share the AMY2 prefix simply because they are both expressed in the pancreas.

Datasets

Short-read sequencing data were compiled from high-coverage resequencing of the 1,000 Genomes Project (1KG) samples¹⁹, the Simons Genome Diversity Panel (SGDP)²⁰, and the Human Genome Diversity Panel (HGDP)¹⁸. Genomes from GTEx²¹ samples were also assessed, but only for gene expression analyses as the ancestry of these samples was not available. In total, we obtained copy number genotype estimates for 5,130 contemporary samples. Among these, 838 are GTEx samples, 698 are trios from the 1KG, and the rest (n = 3,594, that is, 7,188 haplotypes) are unrelated individual samples compiled from the 1KG, HGDP and SGDP. GTEx and 1KG trio samples were excluded from analyses characterizing the global diversity of the amylase locus. We performed haplotype deconvolutions on all unrelated samples as well as trio data (n = 4,292 total), but the trios were only used for validation purposes.

Supplementary Fig. 25 shows structural variant calls from the gnomAD project⁴⁶. Phased SNP calls from 1KG and HGDP samples were compiled from Koenig et al.⁴⁷, which includes all of our 1KG and HGDP samples but only some of the SGDP samples (n = 3,395 total). These data were used for the analyses of linkage disequilibrium, nucleotide diversity, principal component analysis (PCA) and selection scans⁴⁷.

Ancient genome short-read fastq samples were compiled from Allentoft et al.³⁰ and Marchi et al.²⁹ and were mapped to the human reference genome GRCh38 with BWA (v0.7.17; ‘bwa mem’)⁴⁸. The modern genomes and the 14 Marchi et al. genomes are of high coverage and quality; however, the Allentoft et al. samples were of varying quality and coverage³⁰. The Allentoft et al. dataset included more than 1,600 ancient genomes including 317 newly sequenced ancient individuals alongside 1,492 previously published genomes. Unfortunately, many published ancient genomes have been filtered to exclude multi-mapped reads leaving large gaps over regions such as the amylase locus. After removing genomes with missing data, 690 samples remained. We carefully analysed these 690 genomes to determine their quality by quantifying the standard deviation of genome-wide copy number (after removing the top and bottom fifth percentiles of copy number to exclude outliers). We chose a standard deviation cut-off of 0.49 based on a visual inspection of the copy number data and selected 519 samples (approximately 75% of 690) with sufficient read depth for copy number genotyping. Ancient samples were assigned to one of eight major ancient populations in West Eurasia based on their genetic ancestry, location and age obtained from their original publications^29,30,49,50 (Fig. 5a, Supplementary Table 8 and Supplementary Fig. 19). These populations include: Eastern hunter-gatherer, Caucasian hunter-gatherer, Western hunter-gatherer, early farmer (samples with primarily Anatolian farmer ancestry), Neolithic farmer (samples with mixed Anatolian farmer and Western hunter-gatherer ancestry), Steppe pastoralist (samples with mixed Eastern hunter-gatherer and Caucasian hunter-gatherer ancestry), Bronze Age (samples with mixed Neolithic farmer and Steppe ancestry), and Iron Age to early modern. Finally, four archaic genomes were assessed including three high-coverage Neanderthal genomes and the high-coverage Denisova genome^27,51,52,53.

Long-read haplotype assemblies were compiled from the HPRC²³. Year 1 genome assembly freeze data were compiled along with year 2 test assemblies. Haplotype assemblies were included in our analyses only if they spanned the amylase SVR. Furthermore, in cases in which both haplotypes of an individual spanned the SVR, we checked to ensure that the diploid copy number of amylase genes matched with the read-depth-based estimate of copy number. We noted that several year 1 assemblies (which were not assembled using ONT ultralong sequencing data) appeared to have been misassembled across the amylase locus, as they were either discontiguous across the SVR or had diploid assembly copy numbers that did not match with short-read-predicted copy number. We thus reassembled these genomes incorporating ONT ultralong sequence using the Verkko assembler (v1.3.1)⁵⁴, constructing improved assemblies for HG00673, HG01106, HG01361, HG01175, HG02148 and HG02257. Alongside these HPRC genome assemblies, we included GRCh38 and the newly sequenced T2T-CHM13 reference²⁴.

Determination of subsistence by population

The diets of several populations (see Supplementary Table 2) were determined from the literature from the following sources^{2,55,56,57,58,59,60,61,62,63}. We were able to identify the traditional diets for 33 populations. All other populations were excluded from this analysis.

Read-depth-based copy number genotyping

Copy number genotypes were estimated using read depth as described in ref. ¹⁶. In brief, read depth was quantified from BAMs in 1,000-bp sliding windows in 200-bp steps across the genome. These depths were then normalized to a control region in which no evidence of copy number variation was observed in more than 4,000 individuals. Depth-based ‘raw’ estimates of copy number were then calculated by averaging these estimates over regions of interest. Regions used for genotyping are found in Supplementary Table 10. We note that the AMY2Ap pseudogene is a partial duplication of AMY2A that excludes the approximately 4,500 bp of the 5′ end of the gene. This region can thus be used to genotype the AMY2A copy without ‘double counting’ AMY2Ap gene duplicates. Copy number genotype likelihoods were estimated by fitting modified Gaussian mixture model to raw copy estimates across all individuals with the following parameters: k, the number of mixture components, set to be the difference between the highest and lowest integer-value copy numbers observed; π, a k-dimensional vector of mixture weights; σ, a single-variance term for mixture components; and o, an offset term by which the means of all mixture components are shifted. The difference between mixture component means was fixed at 1, and the model was fit using expectation maximization (Supplementary Fig. 1). The copy number maximizing the likelihood function was used as the estimated copy number for each individual in subsequent analyses. Comparing these maximum likelihood copy number estimates with droplet digital PCR yielded very high concordance with r² = 0.98, 0.99 and 0.96 for AMY1, AMY2A and AMY2B, respectively (Supplementary Fig. 1). For comparisons of copy number as a function of sustenance, populations were downsampled to a maximum of 50 individuals. We also used a linear mixed effects model approach in which all samples were maintained, which provided similar results (P = 0.013, P = 0.058 and P = 0.684 for AMY1, AMY2A and AMY2B, respectively).

Analysis of gene expression

Gene expression data from the GTEx project²¹ were downloaded alongside short-read data (see above section). Normalized gene expression values for AMY2A and AMY2B were compared with copy number estimates using linear regression (Extended Data Fig. 2).

MAP-graph construction

Regions overlapping the amylase locus were extracted from genome assemblies in two different ways. First, we constructed a PanGenome Research Tool Kit (PGR-TK) database from the HPRC year 1 genome assemblies and used the default parameters of w = 80, k = 56, r = 4 and min-span = 64 for building the sequence database index. The GRCh38 chromosome 1: 103655518–103664551 was then used to identify corresponding AMY1/AMY2A/AMY2B regions across these individuals. Additional assemblies were subsequently added to our analysis by using minimap2 (ref. ⁶⁴) to extract the amylase locus from those genome assemblies. The MAP-graph and the principal bundles were generated using revision (v0.4.0; git commit hash: ed55d6a8). The Python scripts and the parameters used for generating the principal bundle decomposition can be found in the associated GitHub repository. The position of genes along haplotypes was determined by mapping gene modes to haplotypes using minimap2 (ref. ⁶⁴).

Analysis of mutations at amylase genes

To identify mutations in amylase genes from long-read assemblies and evaluate their functional impact, we first aligned all amylase gene sequences to AMY1A, AMY2A and AMY2B sequences on GRCh38 using minimap2 (ref. ⁶⁴). We then used paftools.js⁶⁴ for variant calling, and vep-v.105.0 (ref. ⁶⁵) for variant effect prediction.

PGGB-based graph construction

Although the existing pangenome graphs from the HPRC provide a valuable resource, we discovered that they did not provide the best reference system for genotyping copy number variation. Our validation of the genotyping approach revealed that we would experience high genotyping error when gene copies (for example, all copies of AMY1 or all copies of AMY2B) were not fully ‘collapsed’ into a single region in the graph. We thus elected to rebuild the graph locally to improve genotyping accuracy for complex structural variants. This achieves substantially improved results by allowing multiple mappings of each haplotype against others, which leads to a graph in which multi-copy genes are collapsed into single regions of the graph. This collapsed representation is important for graph-based genotyping. In addition, we incorporated additional samples, some of which were reassembled by us, that were not part of the original dataset from the HPRC to have a more comprehensive representation of variability in the amylase locus, which required rebuilding the pangenome graph model at the amylase locus.

A PGGB graph was constructed from 94 haplotypes spanning the amylase locus using PGGB (v0.5.4; commit 736c50d8e32455cc25db19d119141903f2613a63)²⁵ with the following parameters: ‘-n 94’ (the number of haplotypes in the graph to be built) and ‘-c 2’ (the number of mappings for each sequence segment). The latter parameter allowed us to build a graph that correctly represents the high copy number variation in such a locus. We used ODGI (v0.8.3; commit de70fcdacb3fc06fd1d8c8d43c057a47fac0310b)⁶⁶ to produce a Jaccard distance-based (that is, 1 − Jaccard similarity coefficient) dissimilarity matrix of paths in our variation graph (‘odgi similarity -d’). These pre-computed distances were used to construct a tree of relationships between haplotype structures using neighbour joining.

Haplotype deconvolution approach

We implemented a pipeline based on the workflow language Snakemake (v7.32.3) to parallelize haplotype deconvolution (that is, assign to a short-read-sequenced individual the haplotype pair in a pangenome that best represents its genotype at a given locus) in thousands of samples.

Given a region-specific PGGB graph (gfa; see ‘PGGB-based graph construction’), a list of short-read alignments (BAM/CRAM), a reference build (fasta) and a corresponding region of interest (chr: start–end; based on the alignment of the BAM/CRAM), our pipeline ran as follows:

1.

Extracted the haplotypes from the initial pangenome using ODGI (v0.8.3; ‘odgi paths -f’)⁶⁶.
2.

For each short-read sample, extracted all the reads spanning the region of interest using SAMTOOLS (v1.18; ‘samtools fasta’)⁶⁷.
3.

Mapped the extracted reads back to the haplotypes with BWA (v0.7.17; ‘bwa mem’)⁴⁸. To map ancient samples, we used ‘bwa aln’ with parameters suggested in Oliva et al.⁶⁸ instead: ‘bwa aln -l 1024 -n 0.01 -o 2’.
4.

Computed a node depth matrix for all the haplotypes in the pangenome; every time a certain haplotype in the pangenome loops over a node, the path depth for that haplotype over that node increases by one. This was done using a combination of commands in ODGI (‘odgi chop -c 32’ and ‘odgi paths -H’).
5.

Computed a node depth vector for each short-read sample; short-read alignments were mapped to the pangenome using GAFPACK (https://github.com/ekg/gafpack; commit ad31875) and their coverage over nodes was computed using GFAINJECT (https://github.com/ekg/gfainject; commit f5feb7b).
6.

Compared each short-read vector (see step 5) with each possible pair of haplotype vectors (see step 4) by means of cosine similarity using (https://github.com/davidebolo1993/cosigt; commit e247261; which measures the similarity between two vectors as their dot product divided by the product of their lengths). The haplotype pair having the highest similarity with the short-read vector was used to describe the genotype of the sample.
7.

The final genotypes were assigned as the corresponding consensus structures of the highest similarity pair of haplotypes.

Our pipeline is publicly available on GitHub (https://github.com/raveancic/graph_genotyper) and is archived in Zenodo (https://zenodo.org/doi/10.5281/zenodo.10843493).

We assessed the accuracy of the haplotype deconvolution approach in several different ways. First, we assessed 35 individuals (70 haplotypes) for which both short-read sequencing data and long-read diploid assemblies were available. In 100% of cases (70 of 70 haplotypes), we accurately distinguished the correct haplotypes present in an individual from short-read sequencing data. We further assessed how missing haplotypes in the pangenome graph might assess the accuracy of our approach by performing a ‘leave-one-out, jackknifing’ analysis. In this approach, for each of the 35 long-read individuals, we rebuilt the variation graph with a single haplotype excluded and tested our ability to identify the correct consensus haplotype from the remaining haplotypes. The true positive rate was approximately 93% in this case. Second, we compared our haplotype deconvolutions to haplotypes determined by inheritance patterns in 44 families in a previous study¹⁵ (Supplementary Table 3). We note that this study hypothesized the existence of an H4A4B4 haplotype without having observed it directly. In our study, we also found no direct evidence of the H4A4B4 haplotype. Furthermore, we found that inheritance patterns are equally well explained by other directly observed haplotypes and thus exclude these predictions from our comparisons (two individuals excluded). We identified the exact same pair of haplotypes in 95% of individuals (125 of 131 individuals), and in 97% of individuals (288 of 298 individuals), the haplotype pair that we identified is among the potential consistent haplotype pairs identified from inheritance. Third, we compared inheritance patterns in 602 diverse short-read-sequenced trios from the 1KG populations¹⁹. For each family, we randomly selected one parent and assessed whether either of the two offspring haplotypes were present in this randomly selected parent. Across all families, this proportion, p, represents an estimate of the proportion of genotype calls that are accurate in both the offspring and that parent, thus the single sample accuracy can be estimated as the square root of p. From these analyses, we identified 533 of 602 parent–offspring genotype calls that are correct, corresponding to an estimated accuracy of 94%. Fourth, we compared our previously estimated reference genome read-depth-based copy number genotypes to those predicted from haplotype deconvolutions across 4,292 diverse individuals. These genotypes exhibited 95–99% concordance across different amylase genes (95%, 97% and 99% for AMY1, AMY2A and AMY2B, respectively). Cases in which the two estimates differed were generally high-copy genotypes for which representative haplotype assemblies have not yet been observed and integrated into the graph (Extended Data Fig. 7a). Overall we thus estimated the haplotype deconvolution approach to be approximately 95% accurate for modern samples, and thus choose not to propagate the remaining 5% uncertainty into downstream analyses.

To determine the impact of coverage and technical artefacts common in ancient DNA, we performed simulations. We selected 40 individuals having both haplotypes represented in the AMY graph and, for those, we simulated short reads mirroring error profiles in modern and ancient genomes across different coverage levels. More specifically, we simulated paired-end short reads for the modern samples with wgsim (https://github.com/lh3/wgsim; commit a12da33, ‘wgsim −1 150 −2 150’) and single-end short reads for the ancient samples with NGSNGS⁶⁹ (commit 559d552, ‘ngsngs -ne -lf Size_dist_sampling.txt -seq SE -m b7,0.024,0.36,0.68,0.0097 -q1 AccFreqL150R1.txt’ following the suggestions by the author in https://github.com/RAHenriksen/NGSNGS). Synthetic reads were then aligned against the GRCh38 build of the human reference genome using bwa-mem2 (ref. ⁷⁰; commit 7f3a4db). For samples modelling modern individuals, we generated 5–30X coverage data, whereas for those modelling ancient genomes, we aimed for lower coverage (1–10X) to better approximate true-to-life data. We ran our haplotype deconvolution pipeline independently for modern and ancient simulated samples, as well as varying coverage levels. Out of 480 tests, only 9 (approximately 1%) yielded incorrect predictions, exclusively in ancient simulated sequences, with coverage ranging from 1X to 4X. Cosine similarity scores for ancient simulated sequences ranged from 0.789 to 0.977 (median of 0.950), whereas scores for modern simulated sequences ranged from 0.917 to 0.992 (median of 0.981; Extended Data Fig. 7b). We therefore conclude that the haplotype deconvolution method is also highly accurate for ancient samples. Out of an abundance of caution, we further imposed a conservative quality score threshold of 0.75 to ancient samples, resulting in 288 ancient samples with high-confidence haplotype assignment out of a total of 533 (Supplementary Figs. 20 and 21). We note that the haplotype deconvolutions in ancient samples are probably more accurate than read-depth genotypes, which tend to be biased towards higher copy number.

Linkage disequilibrium estimation

To investigate pairwise linkage disequilibrium across the SVR region at a global scale, we first merged our copy number estimates with the joint SNP call set from the HGDP and 1KG⁴⁷, resulting in a variant call set of 3,395 diverse individuals with both diploid copy number genotypes and phased SNP calls. In brief, we used bcftools (v1.9)⁶⁷ to filter HGDP and 1KG variant data for designated genomic regions on chromosome 1, including the amylase SVR and flanking regions defined as bundle 0 and bundle 1 (distal and proximal, respectively) using the GRCh38 reference coordinate system (–region chromosome 1: 103,456,163–103,863,980 in GRCh38). The resulting output was saved in variant call format (vcf), keeping only biallelic SNPs (-m2 -M2 -v snps), and additionally filtered with vcftools (v.0.1.16)⁷¹ with -keep and -recode options for lists of individuals grouped by continental region in which we were able to estimate diploid copy numbers. Population-specific vcf files were further filtered for a minor allele frequency filter threshold of 5% (–minmaf 0.05) and used to generate a numeric genotype matrix with the physical positions of SNPs for linkage disequilibrium calculation (R² statistic) and plotting with the LDheatmap⁷² function in R (v4.2.2).

To further dissect the unique evolutionary history of the amylase locus, we compared regions with high R² across the SVR with linkage disequilibrium estimates for pairs of SNPs across regions of similar size in chromosome 1. We specifically focused on pairs of SNPs spanning bundle 0 (chromosome 1: 103456163–103561526 in GRCh38) and the first 66-kb of bundle 1, hereafter labelled as bundle 1a (chromosome 1: 103760698–103826698 in GRCh38), as revealed by the linkage disequilibrium heatmap. Then, we computed the R² values for any pair of SNPs in chromosome 1 for each superpopulation within a minimum of 190-kb distance (that is, the equivalent distance from the bundle 0 end to the bundle 1a start using the GRCh38 reference coordinate system) and maximum 370-kb distance (that is, the equivalent distance from the bundle 0 start to the bundle 1a end using the GRCh38 reference coordinate system). To calculate pairwise linkage disequilibrium across the human chromosome 1 for different populations, we ran plink (v1.90b6.21)⁷³ with options -r2 –ld-window 999999 –ld-window-kb 1000 –ld-window-r2 0 –make-bed –maf 0.05, using population-specific vcf files for a set of biallelic SNPs of 3,395 individuals from the HGDP and 1KG as input. As the resulting plink outputs only provide R² estimates for each pair of SNPs and respective SNP positions, we additionally calculated the physical distances between pairs of SNPs as the absolute difference between the base-pair position of the second (BP_B) and first (BP_A) SNP. We then filtered out distances smaller than 190 kb and greater than 370 kb, and annotated the genomic region for each R² value based on whether both SNPs fall across the SVR or elsewhere in chromosome 1. The distance between SNP pairs was also binned into intervals of 20,000 bp, and the midpoint of each interval was used for assessing linkage disequilibrium decay over genomic distances. The resulting dataset was imported in R to compute summary statistics comparing linkage disequilibrium across each major continental region, or superpopulations, and we used ggplot2 to visualize the results.

Coalescent tree, ancestral-state reconstruction and PCA

To construct the coalescent tree, we first extracted bundle 0 and bundle 1a sequences from all 94 haplotypes (that is, distal and proximal unique regions flanking the amylase SVR) that went through principal bundle decomposition. On the basis of their coordinates on the human reference genome (GRCh38), we used SAMtools (v1.17)⁷⁴ to extract these sequences from three Neanderthal and one Denisovan genomes that are aligned to GRCh38. We used kalign (v3.3.5)⁷⁵ to perform multiple sequence alignment on bundle 0 and bundle 1a sequences. We used IQ-TREE (v2.2.2.3)⁷⁶ to construct a maximum likelihood tree with Neanderthal and Denisova sequences as the outgroup, using an estimated 650 kyr human–Neanderthal split time for time calibration²⁷. We used ggtree (v3.6.2)⁷⁷ in R (v4.2.1) to visualize the tree and annotated each tip with its structural haplotype and amylase gene copy numbers. We used cafe (v5.0.0)⁷⁸ to infer the ancestral copy numbers of each of the three amylase genes along the time-calibrated coalescent tree (excluding the outgroups) and to estimate their duplication/deletion rates. The timing of each duplication/deletion event was estimated based on the beginning and end of the branch along which the amylase gene copy number had changed. We used ggtree and ggplot (v3.4.2) in R to visualize these results, and used Adobe Illustrator (v27.5) to create illustrations for several of the most notable duplication/deletion events⁷⁹.

Next, we performed a PCA combining 94 HPRC haplotype sequences with variant calls for 3,395 individuals from the HGDP and 1KG. We first aligned all 94 bundle 0 and 94 bundle 1a haplotype sequences to the human reference genome (GRCh38) using minimap2 (v2.26)⁶⁴, and called SNPs from haplotypes using paftools.js. Each haplotype sequence appears as a pseudo-diploid in the resulting vcf file (that is, when the genotype is different from the reference, it is coded as being homozygous for the non-reference allele). These haplotype-specific vcf files were merged together and filtered for biallelic SNPs (-m2 -M2 -v snps) with bcftools, resulting in a pseudo-diploid vcf file from 94 haplotype sequences for each bundle. These were then merged with the respective bundle 0 and bundle 1a vcf files from the HGDP and 1KG, also filtered for biallelic SNPs, using bcftools. Finally, we ran plink with a minor allele frequency of 5% (–maf 0.05) to obtain eigenvalues and eigenvectors for PCA and used ggplot (v3.4.2) to visualize the results. These analyses were conducted with bundle 0 and bundle 1a separately, with highly concordant results (Supplementary Figs. 3 and 4). Analyses focused on bundle 0 are mostly reported in the main text (Fig. 3 and Extended Data Fig. 6), whereas bundle 1a results are shown as extended data (Extended Data Fig. 4).

Signatures of recent positive selection in modern human populations

To investigate very recent or ongoing positive selection at the amylase locus in modern humans, we first looked for significant signatures of reduced genetic diversity across the non-duplicated regions adjacent to the SVR compared with chromosome 1 in different populations worldwide. This stems from the assumption that, given low SNP density across the SVR, the high levels of linkage disequilibrium found between pairs of SNPs spanning bundle 0 and bundle 1a indicate that SNPs in bundle 0 or bundle 1 can be used as proxies for the selective history of the linked complex structures of the SVR. We calculated nucleotide diversity (π) on sliding windows of 20,000 bp spanning GRCh38 chromosome 1 with vcftools using population-specific vcf files from the HGDP and 1KG filtered for a set of biallelic SNPs as input. Each window was annotated for the genomic region, namely, bundle 0, SVR and bundle 1a. All windows comprising the SVR were removed from the resulting output due to low SNP density. We then used ggplot2 in R to compare and visualize nucleotide diversity in the flanking regions of the amylase locus (that is, bundle 0 and bundle 1a) and the rest of chromosome 1 for each major continental region or super-population.

To identify either soft-selective and hard-selective sweeps at the flanking regions of the SVR, we computed several different extended haplotype homozygosity-based statistics and statistics based on distortions of the haplotype frequency spectrum (Supplementary Table 5). Vcf files from the HGDP and 1KG chromosomes 1–22 GRCh38 were filtered for biallelic SNPs and minor allele frequency of 0.05 for target populations with over 10 individuals to calculate iHS⁸⁰, nSL⁸¹ and XP-nSL⁸² as implemented in selscan (v2.0.2)⁸³ (see Supplementary Table 5 for a description of populations and selection statistics). Utah residents with Northern and Western European ancestry (CEU) and Yoruba (YRI) populations were also included to confirm the ability of the tests to consistently identify the LCT hard sweep in CEU and in relation to the amylase locus (Supplementary Table 5). Scores for these statistics were normalized using the genome-wide empirical background with selscan’s co-package norm (v1.3.0). This was also used to compute the fraction of the standardized absolute values > 2 for each statistic in non-overlapping 100-kb windows genome-wide⁸⁰. For XP-nSL statistics, modern rainforest hunter-gatherers in Africa and the pastoralists Yakut were used as reference populations, so that positive scores correspond to possible sweeps in the populations with traditionally agricultural diets. We also used lassip (v1.2.0)⁸⁴ to compute H12 and H2/H1 statistics⁸⁵ and saltiLASSI Λ⁸⁴ on sliding windows of 201 SNPs with intervals of 100 SNPs. SNP positions within the SVR were removed from the resulting outputs due to low SNP density. We then compared the average and distribution of all selection statistics across individual SNPs or windows located within bundle 0 and bundle 1a (labelled as ‘AMY region’) and located within chromosome 2: 135–138 Mb (labelled as the ‘LCT region’) with that of the rest of the genome using geom_stats() and geom_density() functions in ggplot2 (Supplementary Table 5 and Supplementary Figs. 6–18). We also used an outlier approach and focused on the top 0.05% of the test statistic across all windows genome-wide for modern populations of known subsistence, and considered estimates above this threshold to be strong signals of selection⁸⁰. To improve detection power, we computed Fisher’s exact score⁸⁶ from SNP ranks for the two selection statistics that were better able to identify signatures of selection at the AMY locus. Then, we investigated whether the scores computed from these statistics for SNPs located at the AMY locus were among the top 1% of Fisher’s exact scores estimated genome-wide (Supplementary Table 5 and Supplementary Fig. 18).

Inference of recent positive selection in West Eurasian populations using ancient genomes

To determine whether changes in the frequency of different structural haplotypes over the past 12,000 years were consistent with positive selection, we first grouped amylase structural haplotypes (n = 11) into those with the ancestral number of amylase gene copies (three total) or with amylase gene duplications (five or more copies). We used three complementary approaches to infer the selection coefficient associated with duplication-containing haplotypes. First, we used ApproxWF³¹ to perform Bayesian inference of the selection coefficient from binned allele frequency trajectories. We ran ApproxWF for 101,0000 Markov chain Monte Carlo (MCMC) steps with parameters n = 10,000, h = 0.5 and pi = 1. We assumed a generation time of 30 years to convert the age of ancient samples from years to generations. The first 10,000 steps of the MCMC process were discarded in all analyses. Next, we used bmws (v0.1.0)³² to estimate the allele frequency trajectory and time-varying selection from genotype data with parameters -d diploid -l 4.5 -g 30 -n 10000 -t. We further ran 1,000 bootstrap replicates to obtain 95% credible intervals around our estimates. Last, we used an approximate Bayesian computation approach adapted and modified from ref. ³³ to explicitly account for the demographic processes underlying the allele frequency changes. We performed extensive forward-in-time simulations using SLiM (v3.7.1)⁸⁷ based on a well-established demographic model for West Eurasians³⁸ that includes major population split and admixture events as well as population growth (Supplementary Table 11). We allowed three model parameters to vary across simulations: selection coefficient (s), the time of selection onset (t, in kyr bp) and the initial allele frequency in the ancestral population (f). Selection is only applied to known agricultural populations (that is, early farmers, Neolithic farmers, and Bronze Age to present-day Europeans), and its strength is assumed to be constant over time. These parameter values were set in evenly spaced intervals (that is, 21 values of s ∈ [−0.01, 0.04], 21 values of t ∈ [3, 15], 31 values of f ∈ [0.05, 0.8]), and 1,000 replicate simulations were run for each unique parameter combination. This resulted in 13,671,000 simulations in total. For each simulation, we calculated the difference between the observed and the expected binned allele frequency trajectories, accounting for uneven sampling in time and genetic ancestry. We then selected the top 0.1% of simulations (that is, 13,671 simulations) that best resemble the observed data to approximate the posterior distribution of model parameters. We also examined the allele frequency changes (that is, the difference between allele frequencies in the first and last time bin) across all neutral simulations with s = 0 and compared them with the observed allele frequency change in the data (Supplementary Fig. 25).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

Emx2 underlies the development and evolution of marsupial gliding membranes

[ad_1]

The P. breviceps sugar glider breeding colony

All experiments performed were approved by the IACUC committee at Princeton University. Captive-born, adult sugar gliders were obtained from the US pet trade and thereafter maintained in a breeding colony at Princeton University. The colony is kept under a 12 h–12 h light–dark cycle (temperature, 20–27 °C; humidity, 30–70%). Animals are fed daily a diet consisting of dried food, fruits and protein. Animals are housed in breeding pairs or trios. Female sugar gliders are inspected daily for pouch young by gently palpating the maternal pouch. Pouch young discovered by palpation are visually examined by briefly anaesthetizing the mother with isoflurane and gently everting the pouch to expose the neonate. For tissue collection, joeys are gently detached from the nipple, euthanized and processed in the laboratory. Both male and female joeys were used in all experiments. More details on our breeding colony can be found elsewhere¹⁴.

Scanning electron microscopy analysis

Sugar glider joeys were fixed and stored in 2% glutaraldehyde at 4 °C. After making several small incisions in the abdomen to increase infiltration, we treated the embryos 1% osmium tetroxide for 90 min at room temperature followed by critical point drying. We sputter-coated all specimens with a 1-Å-thick coating of palladium and then imaged them on a SU3500 scanning electron microscope under a high vacuum.

Sample acquisition, genome sequencing and genome assemblies

The following samples were obtained from the ABTC collection of the South Australian Museum, following all established protocols and guidelines from museum authorities: A. pygmaeus (ABTC 77168, liver); D. trivirgata (ABTC 49304, kidney); D. pennatus (ABTC 44094, liver); P. volans (ABTC 83627, kidney); P. peregrinus (ABTC 138055, muscle); Pseudocheirus occidentalis (ABTC 7808, kidney); Pseudochirops cupreus (ABTC 45036, kidney); Pseudochirops corinnae (ABTC 49246, kidney); T. rostratus (ABTC 7742, kidney). A kidney sample from a female sugar glider, P. breviceps, was obtained from the Princeton University breeding colony. The samples were processed for genome sequencing and assembly using a combination of Illumina short-read sequencing and Hi-C for scaffolding. The full pipeline for sequencing and assembly used here has been described in detail previously^18,49. Using this approach, we were able to generate genome assemblies for 13 out of the 15 species. For D. trivirgata and T. rostratus, we were not able to generate a Hi-C assembly owing to poor sample quality. Thus, after generating short-read sequencing data for these two species, we performed contig assembly using MEGAHIT (v.1.1.4.2)⁵⁰ and used Redundans (v.0.14a)⁵¹ for haplotype purging, preliminary scaffolding, and gap closing. Lastly, we scaffolded against the Hi-C sugar glider genome with ragtag v.2.0.1 (parameters: -g 50 -m 10000000 -f 200)⁵². This approach is particularly suitable in our case, given the short phylogenetic divergence among petauroids. The representation of mammalian universal single-copy orthologues in the different genomes was assessed with BUSCO (v.5.4.4)²⁷ using the curated mammalian v.10 database¹⁹. The assembly metrics for all genome assembles in this study are reported in Supplementary Table 1.

ATAC–seq analysis

ATAC–seq was performed on live single-cell suspensions from the patagium primordium, which were prepared as follows: tissue was dissected from P5 sugar glider joeys, placed into 1× PBS, transferred into 0.2% dispase/RPMI and incubated at 37 °C for 1 h. After this incubation, the epidermis and dermis were separated using forceps. The dermis was washed once in 1× PBS, transferred to 0.2% collagenase/RPMI and incubated at 37 °C for 2 h. The resulting dermal cell suspension was passed through a 40 μm mesh filter, and the remaining enzyme was washed by adding 30 ml of 0.2% bovine serum albumin (BSA)/1× PBS and centrifuging the cells for 10 min at 500 rcf (4 °C). The supernatant was discarded, and the pellet was resuspended in 0.2% BSA/1× PBS and filtered again into a 1.5 ml Eppendorf tube. After determining cell viability using Trypan Blue (Sigma-Aldrich), a total of 100,000 live cells per sample (n = 2) was set aside and used for library preparation.

Library preparation was performed according to the Omni-ATAC method⁵³. In brief, cells were lysed for 3 min on ice. The transposition reaction on the permeabilized nuclei was performed using TDE1 transposase (Illumina) at 37 °C for 60 min, and then purified using the Zymo DNA Clean and Concentrator-5 kit (Zymo). Illumina sequencing adapters and barcodes were added to the transposed DNA fragments by PCR amplification. The purified ATAC–seq library products were examined on Agilent Bioanalyzer DNA High Sensitivity chips for size distribution, quantified using the Qubit fluorometer (Invitrogen) and pooled at equal molar amounts. The ATAC–seq library pool was sequenced on the Illumina NovaSeq 6000S Prime flowcell as paired-end 61 nucleotide reads. We generated 62,387,513 and 59,728,230 reads for libraries 1 and 2, respectively. Raw sequencing reads were filtered by the NovaSeq control software and only the pass-filter reads were used for further analysis. Raw ATAC–seq reads were trimmed using NGmerge v.0.2_dev and mapped to the Hi-C P. breviceps assembly using Bowtie2 (v.2.4.2)⁵⁴. Alignments were further processed by removing duplicates using picard MarkDuplicates SNAPSHOT v.2.21.4 (http://broadinstitute.github.io/picard), and samtools (v.1.12)⁵⁵ was used to filter reads and convert files into BAM format. Peak calling on each biological replicate (n = 2) was conducted using MACS2 (v.2.2.7.1)⁵⁶ (parameters: –nomodel -q 0.05 –keep-dup all –shift -100 –extsize 200 -g 2456432000 –nolambda). To assess the concordance of peak calls between our replicates, we used the Irreproducible Discovery Rate IDR (v.2.0.4.2)⁵⁷ and only those peaks passing a false-discovery rate threshold of 0.05 were used. We then used BEDTools intersect (v.2.27.1)⁵⁸ to remove regions of the ATAC peak calls that directly overlapped annotated exons.

ChIP–seq analysis

ChIP assays were performed using flash-frozen samples from the patagium primordium of P5 sugar glider joeys. For each reaction, patagium primordia from 6–7 joeys were pooled into a single tube. Approximately 20 mg of tissue was cross-linked with 1% formaldehyde and chromatin was extracted and sheared. The following antibodies were used in the ChIP assays: anti-H3K27ac antibody (Abcam, ab4729, GR232896-1, 4 μg), anti-EMX2 antibody (Novus, NBP2-39052, 27711) and a negative control anti-IgG antibody (Millipore, 12-370, 297424, 4 μg). For H3K27Ac ChIP–seq, chromatin was divided into two experimental samples (each experimental sample consisting of pooled tissue from 6–7 joeys) and one control sample, and ChIP assays were performed in triplicate for each of the experimental samples and in duplicate for the negative control sample. In all cases, we used 7 µg of chromatin and 4 µg of antibody. ChIP DNA was then processed into three standard Illumina ChIP–seq libraries (two experimental and one control) and sequenced to generate the following number of paired end reads: 56,614,872 (control); 56,957,310 (experimental 1); and 70,392,940 (experimental 2). EMX2 ChIP–seq analysis was performed in a similar manner, except that only one experimental (pooled tissue from 6–7 joeys) and one control library were generated and sequenced to generate the following number of single-end reads: 43,909,990 (control) and 41,672,698 (experimental). Raw ChIP–seq reads were trimmed using NGmerge v.0.2_dev and mapped to the Hi-C P. breviceps assembly using Bowtie2 (v.2.4.2)⁵⁴. Alignments were further processed by removing duplicates using picard MarkDuplicates SNAPSHOT v.2.21.4, and samtools (v.1.12)⁵⁵ was used to filter reads and convert files into BAM format. Broad peaks were processed and broad peaks on each biological replicate (n = 2) were called with MACS2 (v.2.2.7.1)⁵⁶ (parameters: –broad -f BAMPE -g 2456432000 -q 0.05 –nolambda), using the input as a background control. Broad peak calls could not be filtered using IDR and, instead, peaks between the two replicates were concatenated and overlapping peaks were merged using BEDTools merge (v.2.27.1)⁵⁸. To identify enriched motifs in our EMX2 ChIP–seq data, we scanned all called ChIP peaks using the Simple Enrichment Analysis (SEA) program⁵⁹ from the MEME suite (v.5.5.4)⁶⁰.

Computational analyses

Identifying GARs

To identify GARs, we used a statistical phylogenetic test for acceleration along a specific branch of a phylogeny, as implemented in PhyloP²⁵. PhyloP requires two inputs: (1) a species tree, typically estimated from genome-wide data; and (2) a multiple-sequence alignment for each genomic region to be tested for acceleration. To obtain a species tree, we aligned orthologous coding sequences from the seventeen species included in our analysis (Supplementary Tables 1 and 2) and extracted fourfold degenerate sites. We then used RAxML (v.8.2.12)⁶¹ (parameters: -f a -x 50217 -p 50217 -# 1000 -o Pcine,Vursi -m GTRGAMMA) and Phylofit (RPHAST suite v.1.6.9)⁶² to produce a guide species tree and mod file, respectively (Fig. 1c). In parallel, we generated a tree using the first and second codon positions and found that it was identical to that produced by fourfold degenerate sites (Supplementary Fig. 3). To obtain a set of candidate sequences (that is, candidate cis-regulatory elements) that we could then test for acceleration, we focused on overlapping peaks between our ATAC–seq and our ChIP–seq sugar glider datasets. Peaks were considered overlapping if they had at least 1 bp in common between both datasets. Using this approach, we identified 52,169 candidate cis-regulatory elements. The size distribution of the candidate cis-regulatory elements is shown in Supplementary Fig. 4.

To identify orthologues of the 52,169 sugar glider candidate cis-regulatory elements across the 15 other diprotodont genomes examined in our study, we used a comparative annotation approach. First, we annotated conserved coding genes in each species by lifting-over gene model from the high-quality RefSeq annotation of the koala genome to each other genome using LiftOff v.1.6.3 (parameters: -d 4). We then conducted a second lift-over of sugar glider candidate cis-regulatory elements to each other species using the same procedure but with the addition of a flanking sequence to improve candidate cis-regulatory element mappability and reduce the chances of multi-mapping (parameter: -flank 1). We next used synteny anchoring⁶³ to validate candidate orthologues of sugar glider candidate cis-regulatory elements. For each of the 15 non-sugar-glider marsupial genomes, we created a list of candidate cis-regulatory element orthologues and their flanking genes (excluding genes that were not annotated in the reference sugar glider genome). Then, for each candidate cis-regulatory element orthologue in each species, we compared the identities of their flanking genes to those in the sugar glider genome. We considered elements to be orthologues of their reference sugar glider candidate cis-regulatory element candidate if (1) the first flanking gene, upstream or downstream, matched that in the sugar glider genome and (2) if those flanking genes in the target species were no greater than four times the distance from the candidate cis-regulatory element measured in the sugar glider genome. Candidate cis-regulatory element orthologues that passed this synteny check were then extracted from their respective genomes using gffread (v.0.12.7)⁶⁴.

Candidate cis-regulatory element orthologues across all species were then combined into a multi-fasta file and aligned using MAFFT v.7.453 (parameters: –adjustdirectionaccurately –localpair –maxiterate 1000). As our downstream evolutionary analyses were based on nucleotide substitution rates, alignment filtering was designed to address two key considerations: first, that estimated substitution rates in very gappy alignment regions containing a large number of gaps may not be reliable; and, second, that filtering based on similarity risks biasing the results of analyses that are predicated on measuring sequence divergence. We therefore used a filtering approach similar to that implemented by a previous study⁶⁵ and the Avian Phylogenomics Project⁶⁶. First, we trimmed the flank region from each non-sugar-glider target species by removing alignment columns outside the bounds of the sugar glider reference candidate cis-regulatory element sequence. Next, gappy alignment ends were trimmed using a 20 bp sliding window until 75% of species used in our analyses had a window with less than or equal to 5 gaps (Ns). To mitigate internal gap columns, we first used TrimAI v.1.4.rev15 (parameter: –gappyout) and then removed individual orthologue sequences that, after our gap filtering, still contained greater than 25% gaps or retained any individual gaps greater than 10% of the alignment length. Finally, we retained only those alignments that included sequences from at least five species, including at least one glider and non-gliding sister species and least one species from each outgroup. GAR alignments are shown in Supplementary Data 6. We next used phyloP²⁵ from the RPHAST suite (v.1.6.9)⁶² (method=“LRT”, mode=“ACC”) to identify sequences exhibiting accelerated substitution rates, testing each of the three gliding species (P. breviceps, A. pygmaeus and P. volans) independently. PhyloP performs comparisons on a per-species basis. That is, it compares a given genomic region (ATAC/ChIP peaks in our case) to the average genome substitution rate in the corresponding species. Thus, each species will have its own set of accelerated regions. Once we had a set of accelerated elements for each species, the resulting lists were compared to each other to establish which of those showed overlap among the glider species.

The superfamily Petauroidea is part of the Diprodontia, the largest extant order of marsupials that also includes the superfamily Phalangeroidea and the suborders Vombatiformes and Macropodiformes. Although Vombatiformes are an accepted outgroup, the phylogenetic relationships among the other three groups remain unclear^13,67,68. In agreement with a previous study¹³, our analysis using whole-genome data, both from fourfold degenerate sites and first and second codon positions, supports the placing of Petauroidea + Macropodiformes as sister groups (Fig. 1c and Supplementary Fig. 3 (topology 1)). However, other studies have placed Petauroidea + Phalangeroidea (topology 2)⁶⁸ or Petauroidea + Phalangeroidea (topology 3)⁶⁷ as sister groups. To verify that topology had no effect on our results, we conducted additional phyloP analyses forcing topologies 2 and 3. Our analysis recovered around 97–99% of the elements in each glider species, including all the ones around Emx2. Specifically, topologies 2 and 3 recovered, respectively, 99.4% and 97.4% of elements for A. pygmaeus; 99.6% and 97.1% of elements for P. breviceps; and 99.8% and 97.9% of elements for P. volans. Thus, while resolving the phylogenetic relationships among Diprotodonts falls outside the scope of our study, our analyses show that our results are not affected by differences in topology.

Gene enrichment analysis

To test for enrichment of accelerated sequences near genes differentially expressed in the patagium, we first associated candidate cis-regulatory elements with their putative target genes using contact domains determined from our Micro-C data. First, contact domains called with windows sizes of 10, 25 and 50 kb were integrated using an approach described previously⁶⁹. In brief, contact domain calls were combined and filtered such that fully overlapping and fully nested contact domains across all call sets were merged, whereas single and partially overlapping contact domains were retained. For all genes, we grouped all transcripts, sorted them by length and selected the largest transcript. In other words, we used the longest transcript as a representative for each gene. The TSS for each gene was determined based on the location of the first annotated exon. If this exon did not start with an ATG, it was considered the 5′ untranslated region, and the TSS was annotated as the site 1 bp directly upstream of the first exon. For exons that did begin with an ATG codon, the TSS was estimated to be approximately 1 kb upstream from the translation start site. For each remaining TAD, its overlap with annotated TSSs and a candidate cis-regulatory element was calculated using BEDTools (v.2.27.1)⁵⁸, using the intersect function with parameter -wo. We then assigned each candidate cis-regulatory element to the nearest gene TSS using BEDTools closest v.2.27.1, using the default settings, and created a table that included information on whether the sequences assigned to each gene were GARs. For this analysis, we merged the list of GARs from each of the three gliding species, and all of the gene–enhancer annotations were conducted based on sugar glider genome coordinates. Out of 24,495 genes annotated in the sugar glider genome, we tested 11,044. The remaining genes were either not assigned to contact domains or had no candidate cis-regulatory elements assigned to them. For each gene we computed the one-tailed hypergeometric P value of observing a given number of GARs from the total number of candidate cis-regulatory elements near a gene, where the number of trials is the total number of GARs and candidate cis-regulatory elements in our dataset. This analysis yielded a total of 1116 genes enriched for GARs.

Our contact domain-based gene enrichment analysis represents a biologically meaningful approach because it relies on three-dimensional enomic interactions occurring when the patagium primordium is developing. However, previous studies have also used a distance-based strategy, in which genes are assigned to candidate cis-regulatory elements based on physical proximity along the chromosome⁶⁵. To compare the outcomes of using a contact domain-based and a distance-based method, we reanalysed our data using a distance-based approach. As described above, we computed the probability of observing a given number of accelerated candidate cis-regulatory elements (GARs) based on a binomial distribution, where the number of trials is the total number of candidate cis-regulatory elements assigned to a gene and the probability of success is the proportion of all accelerated candidate cis-regulatory elements (GARs) over the total number of candidate cis-regulatory elements. Among the 24,495 genes annotated in the sugar glider genome, we tested 15,178 (the remaining genes did not have any candidate cis-regulatory elements assigned to them) and found that 1,638 were enriched for GARs. Of the 1,638 genes enriched, 27 were upregulated in the developing patagium compared with both the dorsal and shoulder skin. Of these 27, 21 were recovered with our contact domain-based analysis, demonstrating that the two approaches yield comparable results.

Finally, to establish the extent to which our results were being biased by contact domain size, we examined the size distribution of our contact domains alongside the corresponding number of enhancers. Our analysis revealed a low correlation between contact domain size and the number of enhancers (R² = 0.05) (Supplementary Fig. 5a). The majority of contact domains had 25 or fewer enhancers and were between 100,000 and 1,000,000 bp in size (Supplementary Fig. 5b,c). Notably, the contact domains containing our genes of interest are no larger than other contact domains and do not contain the greatest number of enhancers. Moreover, the largest contact domain (8,000,000 bp) did not contain the highest number of enhancers, and the contact domain containing the highest number of enhancers (~4,000,000 bp and 449 enhancers) does not contain any of our 59 genes of interest. Together, this analysis supports the notion that longer contact domains do not contain more active enhancers.

Conservation between petauroid marsupials

To examine whether Emx2-associated GARs are evolutionarily conserved elements, we analysed all marsupial genomes shown in Fig. 1c and generated conservation scores for genomic regions surrounding these elements. In brief, ~20 kb genomic regions surrounding Emx2-associated GAR orthologues were extracted from each species. These regions were then aligned with MAFFT v.7.453 (parameters: –genafpair, –ep 0 and –maxiterate 1000). Per-base conservation scores were then calculated using phyloP (parameters: –method LRT and –wig-scores). Conservation scores were then visualized against the reference sugar glider sequence in IGV as a heat map.

Conservation between marsupials, laboratory mice and humans

We assessed the conservation of Emx2-associated GARs in eutherian mammals using the laboratory mouse (mm10) and human (hg38) genomes. We used the UCSC genome browser blat tool^70,71 to identify orthologous regions between human, mouse and marsupial GARs. The P. breviceps ATAC peaks sequences were used as queries. For 5 out of the 6 GARs (that is, GAR 41701, GAR 16519, GAR 51182, GAR 32020 or GAR 13585), there were no hits that had either a strong score or that were located on the same chromosome as Emx2, even though the syntenic relationships among eutherians and marsupials in the region surrounding Emx2 were conserved. For GAR 11730, located in the Emx2 promoter, we identified a clear eutherian orthologue. We then used blat to search for these same P. breviceps sequences in the genome of the grey short-tailed possum (M. domestica), a non-petauroid marsupial from the New World. We were able to identify high-confidence hits for all six GARs. We then used the multiz alignment conservation scores in human and mouse and found that, other that the promoter element, the rest of the GARs had either no or very sparse conservation. We used both NCBI BLASTn and discontinuous mega-blast⁷² to identify orthologous sequences for our six GARs. We performed a search in both mouse and human databases using the default parameters. Subsequently, we selected the top hit from each species located on the same chromosome as Emx2. We then conducted a reciprocal best BLAST against our P. breviceps database. Our findings mirrored the results obtained through our blat analysis, as we could only confidently recover an orthologous element for GAR 11730.

Gene Ontology and KEGG pathway analysis

Genes were examined for enrichment of KEGG Pathway and GO Biological Process terms using the Enrichr web server⁷³. Protein–protein predicted associations were assessed using the STRING web server³⁸.

Transcription-factor-binding analysis

Transcription-factor-binding motif analysis was performed using the MEME suite (v.5.5.4)⁶⁰. The P. breviceps GAR 11730 sequence was tested for enrichment of motifs using the D. trivirgata sequence used as a control. We also conducted the inverse to search for potential P. breviceps loss of motifs. Transcription factor motif scans across the orthologous GARs were performed using XSTREME²⁸. The sequence of each species was run using the default parameters. For indel analysis, aligned sequences were scanned by eye to identify P. breviceps-specific insertions and deletions. These regions, plus 5–10 bp flanking sequences, were then run through XSTREME, altering parameters to account for the length of the sequences used (-nmotifs 10 -minw 4 -maxw 30). Tomtom⁷⁴ was used to identify genes associated with the identified motifs.

To determine whether candidate genes were potentially regulated by Emx2, we scanned the contact domains containing our candidate genes and used Bedtools intersect v.2.27.1 to determine whether there was overlap between the ATAC peaks associated to each gene and the Emx2 ChIP–seq peaks.

For transcription factor motif conservation analysis, we first used FIMO³⁷ (MEME suite v.5.5.4) to identify sites matching the de novo identified sugar glider EMX2–binding motif, originally identified from the ChIP–seq experiment (ATTARCNV), with a P value of 0.01 or less. We then created a multiple-sequence alignment of the ATAC peaks from all of our species and looked for these identified sites. If they were present in at least half of the species, we considered the site conserved. If a site was not conserved in at least half of the species, we then assessed whether it displayed glider-specific conservation.

scRNA-seq analysis of laboratory mouse data

An existing scRNA-seq dataset⁷⁵ from dorsal skin of mouse embryos at E12.5, E13.5, E14.5 and E15.5 was reprocessed using the Seurat package (v.4.3.0)⁷⁶. Libraries from each timepoint were filtered to contain only cells with >200 and <4,000 expressed genes and <5% mitochondrial gene counts. After the filtering steps, Seurat Objects generated for reads from each timepoint were merged, gene counts were normalized and variable features were identified for each timepoint. The objects were then integrated by selection of integration features and anchors (n = 30 dimensions) and the integrated object was scaled. Significant principal components (n = 30) were identified and used to generate a uniform manifold approximation and projection (UMAP) for dimensional reduction. User specified dimensions were used to define neighbours and clusters. Cell type annotation was performed as previously described⁷⁵.

Dermal fibroblasts identified at all timepoints in the integrated object were subset based on established markers (Lum, Pdgfra, Crabp1 and Lox). The integrated object was split and reclustered, after normalization of the gene counts and reselection of variable features, integration features and integration anchors. Significant principal components, UMAP generation and neighbour/cluster identification were performed as described above. After subclustering, the expression of Emx2 and 13 patagium upregulated transcription factors was examined in each cluster using violin plots. For transcription factors showing expression in the subcluster marked by Emx2 (cluster 2), co-expression with Emx2 was explored by plotting normalized counts of each gene with a blended matrix of normalized expression of the two genes.

Micro-C analysis

Micro-C libraries were prepared using the Dovetail Micro-C Kit, according to the protocol suggested by the manufacturer (Dovetail Genomics). In brief, single-cell suspensions of the patagium primordium of P5 sugar glider joeys were generated as described for ATAC–seq and frozen at −80 °C. A total of ~1,000,000 pooled primary cells, corresponding to approximately 4–5 joeys, was used to generate each library. After thawing the cells, chromatin was fixed using disuccinimidyl glutarate and formaldehyde. Cross-linked chromatin was then digested in situ with micrococcal nuclease (MNase). After digestion, cells were lysed with SDS and chromatin fragments were bound to chromatin capture beads. Chromatin ends were then repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter-containing ends. After proximity ligation, cross-links were reversed, associated proteins were degraded, DNA was purified and a sequencing library was generated using Illumina-compatible adapters. In total, three libraries were generated. Before PCR amplification, biotin-containing fragments were isolated using Streptavidin beads and the libraries were sequenced on the Illumina HiSeq X platform to generate 660,352,634 (2 × 150 bp) reads.

Raw Micro-C reads were analysed according to the Dovetail documentation (Dovetail Genomics). Reads from the three libraries were pooled and mapped to the Hi-C P. breviceps assembly using BWA (v.0.7.15-r1188)⁷⁷. Subsequently, pairtools v.0.3.0 was used to remove PCR duplicates, sort and record valid ligation events. Moreover, pairtools select v.0.3.0 was used to determine unique pairs (with the ‘(pair_type==“UU”) or (pair_type==“UR”) or (pair_type==“RU”) or (pair_type==“uu”) or (pair_type==“Uu”) or (pair_type==“uU”)’ option) and samtools (v.1.12)⁵⁵ was used to convert files into BAM format. The Hi-C matrix was produced using juicer tools (v.1.22.01)⁷⁸ and contact domains (at 10 kb, 25 kb and 50 kb) were called using juicer tools arrowhead (v.1.22.01) (parameters: -k KR –ignore-sparsity). We used Mustache (v.1.0.1)⁷⁹ to conduct loop calling (at 5 kb and 10 kb) on individual scaffolds (parameters: -d 250000 -st 0.7 -pt 0.05) and Higlass-manage v0.8.0⁸⁰ to visualize the data. We used Cooler (v.0.8.5)⁸¹ to prepare files and generate cooler multi-resolution contact maps. Clodius (v.0.3.5) and Higlass-manage (v.0.8.0)⁸⁰ were used to format and ingest other files.

Generation of immortalized sugar glider fibroblasts

The method for generating immortalized dermal fibroblasts has been described in detail previously^75,82. In brief, a skin biopsy was obtained from the trunk region (encompassing dorsal and lateral skin) of a P10 sugar glider joey and fat/connective tissue was scraped away. The sample was digested overnight at 4 °C in a solution containing HBSS without Ca²⁺ and Mg²⁺ (Thermo Fisher Scientific), dispase (500 caseinolytic units per ml (Corning)), and an antibiotic/antimycotic solution (100 μg ml⁻¹ streptomycin, 100 IU ml⁻¹ penicillin and 250 ng ml⁻¹ amphotericin B (HyClone)). After removing and discarding the epidermis, the dermis was cut into small pieces and fibroblasts were expanded as described previously⁷⁵. To generate an immortalized dermal fibroblast cell line, cells were transduced with undiluted γ-retroviral pseudoparticles according to procedures described previously⁷⁵. Cells were verified as negative for mycoplasma by testing with the MycoAlert Mycoplasma Detection Assay kit (Lonza) according to the manufacturer’s instructions. We profiled our cell line using RNA-seq and found that it displays robust levels of Emx2 and other key genes that we previously identified as being expressed in the patagium tissue (such as Wnt5a, Tbx3, Tbx5, Hand3 and Osr1)¹⁴. Moreover, we did not detect expression of genes like Shh or Pax5, a result that is also consistent with our previous transcriptional analysis of patagium tissue¹⁴ (Supplementary Fig. 6 and Supplementary Data 7). These results suggest that our sugar glider cell line provides an adequate in vitro model to test hypotheses about the upstream and downstream regulation of Emx2.

Luciferase assays

GAR analysis

For each candidate GAR, we synthesized the sequence from the glider species in which the sequence was found to be accelerated as well as from the corresponding non-gliding sister species (Twist Biosciences). The size of the sequence was defined by the overlap between the ATAC and the H3K27Ac ChIP peak identified in the sugar glider tissue. GARs and corresponding orthologues were cloned either into the pGL4.23 Luciferase Enhancer Reporter Vector (Promega) (GARs 41701, 16519, 32020, 51182, 13585) or the pGL4.10 Luciferase Promoter Reporter Vector (Promega) (GAR 11730) using In-fusion cloning (Takara). All of the resulting constructs were verified by Sanger sequencing. Immortalized sugar glider cells were seeded at a density of 5,000 cells per well and, the next morning, were transfected with the experimental constructs by using Lipofectamine (Invitrogen) (300 nl Lipofectamine to 200 ng plasmid DNA per well). To control for transfection, a control Renilla reporter vector (Promega 4.73) was co-transfected into each well (20 ng). After 48 h, cells were collected and processed using the DualGlo Luciferase Assay System (Promega) according to the protocol guidelines, and luciferase production was measured using a SpectraMax L luminometer (Molecular Devices). Luciferase activity was normalized relative to Renilla activity.

Wnt5a–Emx2 interaction analysis

We synthesized a 241 bp fragment corresponding to the region immediately downstream of the sugar glider Wnt5a first exon and a second one, in which the three Emx2 binding sites in this sequence were replaced by G bases (that is, ATTA to GGGG) (Twist Biosciences). We used In-fusion cloning (Takara) to clone each of the fragments into the pGL4.10 Luciferase Promoter Reporter Vector (Promega), in front of the luciferase coding sequence. Co-transfection experiments were carried out with either a GFP expression plasmid (Addgene 11153) or an Emx2 expression plasmid (Origene, NM_010132). All of the resulting constructs were verified by Sanger sequencing. For co-transfection experiments, we used 300 nl Lipofectamine to 200 ng plasmid DNA per well (100 ng of Luciferase vector and 100 ng of co-transfection GFP/Emx2). Cell seeding and all downstream experiments and analyses were performed as described above.

Immunostaining

For sugar glider and mouse immunofluorescence and IHC analysis, tissue samples were collected and fixed in 4% paraformaldehyde (PFA) overnight, washed in 1× PBS and incubated in 30% sucrose at 4 °C overnight. The samples were then embedded in optimal cutting temperature (OCT), flash-frozen and cryosectioned (16 μm thickness) using the Leica CM3050S Cryostat. The slides were kept at −80 °C until use. Antibody stains were performed on tissue sections using standard procedures. In brief, slides were washed with 1× PBS with 0.1% Tween-20 (PBT) and blocked with 1× PBT/3% BSA for 1 h. Rabbit anti-EMX2 (Novus NBP2-39052; 1:50), chicken anti-GFP (Novus Biologicals, NB100-1614, 1:200) or rabbit anti-KRT14 (BioLegend, 905301, 1:1,000) primary antibodies were diluted in 1× PBT/3% BSA and slides were incubated at 4 °C overnight. The next morning, the slides were washed several times with 1× PBT and incubated with secondary antibodies (Alexa-Fluor 488 (Thermo Fisher Scientific, ab150169, dilution 1:500); goat anti-rabbit Biotinylated (Vector Labs, R.T.U. (BP-9100-50; ready to use dilution). The reactions were visualized with HRP–streptavidin and the AEC substrate kit (Vector Labs: 568SK4200) or Alexa-dye-conjugated secondary antibodies (Thermo Fisher Scientific). The slides were washed several times with 1× PBT and mounted for imaging. AEC images were taken on a Nikon NiE upright microscope and fluorescence images on the Nikon A1R confocal microscope. NIS Elements v5 (Nikon) was used to acquire microscopy images.

Statistics and reproducibility

Micrographs shown in Fig. 3b,c,g, 4a and 5a–c,e and Extended Data Figs. 5b, 7a,b and 8 constitute representative data from at least three biological samples.

In vitro shRNA experiments

The RNAi Consortium (TRC) mouse lentiviral library carries several Emx2 shRNA constructs⁸³. We aligned the laboratory mouse and sugar glider Emx2 coding sequence and designed five different sugar glider species shRNA constructs targeting the same regions that were previously used for targeting the laboratory mouse locus⁸³. Moreover, we used the scrambled control recommended by the RNAi Consortium⁸³ (a list of all sequences is provided in Supplementary Table 5). We cloned the different sequences into the LV-GFP plasmid³⁴, which is designed for RNU6-1 promoter-driven shRNA expression. Large-scale production of VSV-G pseudotyped lentivirus was performed using calcium phosphate transfections in HEK293FT cells and the helper plasmids PMD2.G and psPAX2 (Addgene plasmids 12259 and 12260, respectively), as described previously³⁴. For viral infections, we plated cells in six-well dishes at 300,000 cells per well, and incubated them overnight with unconcentrated lentiviruses in the presence of polybrene (100 μg ml⁻¹). The medium was replaced the next morning and cells were allowed to grow for 5 days. After this period, cells were sorted using FACS, grown for five additional days and sorted using fluorescence-activated cell sorting (FACS) a second time. Once cells reached confluency, RNA was extracted using the Zymo Directzol Kit (Zymo Research). The ability of the different shRNA constructs to induce Emx2 downregulation was determined using qPCR, using the primers listed in Supplementary Table 5. When introduced into cultured sugar glider immortalized dermal fibroblasts, the different shRNAs reduced Emx2 mRNA levels to 27% (shEmx2-1), 48% (shEmx2-2), 5% (shEmx2-3), 47% (shEmx2-4) and 38% (shEmx2-5), compared to a scrambled control shRNA (shScram). We selected shEmx2-2 and the shScram control for downstream analyses.

In-pouch lentiviral transgenesis

To develop sugar glider transgenesis, we used the LV-GFP plasmid³⁴. For subsequent shRNA experiments, we used shEmx2-3 (the most effective of the constructs that we tested) and the shScram control. Large-scale production of lentiviruses was performed as described for in vitro work, except that lentiviruses were subjected to ultraconcentration according to established protocols^34,84. For injections, we anaesthetized female sugar gliders, gently exposed a P3 joey from inside the pouch and intradermally injected ~2.5 μl of concentrated virus into the interlimb region (one side only) using a 33-gauge needle. After injection, the joey was gently placed back inside the pouch, and the mother was monitored until fully awake from the effect of anaesthesia. Several days later (P6 for RNA-seq/qPCR analysis and P9 for histology/measurements), females were anaesthetized, joeys were collected by gently detaching their jaw from the nipple, placed in a temporary container and brought back to the laboratory, where they were euthanized and processed for downstream analysis.

Phenotypic measurements

After confirming GFP expression by visually inspecting the tissue under a dissecting scope, joeys were fixed, embedded, cryosectioned and stained with DAPI. We used Fiji v.2.1.0 to measure the area of the left and right patagium in at least three non-consecutive sections per sample (n = 4 (shEmx2-3) and n = 4 (shScram) samples) and calculated the control/experimental ratio. Each measurement was performed three independent times and the results were averaged. All counts were performed in a blinded manner (that is, the researcher performing the measurements was unaware of which images corresponded to which genotypes/experimental conditions). Statistical differences were established using a mixed effects model one-way ANOVA test.

Tissue collection for qPCR and RNA-seq analysis

We injected the lateral skin of P3 joeys with either the shEmx2-3 or shScram lentivirus. Joeys were euthanized at P6 and infected patagium tissue, as visualized using GFP, was dissected and preserved in RNAlater. RNA was extracted using the RNeasy fibrous tissue mini kit (Qiagen) according to the manufacturer’s protocol. For qPCR, we used n = 5 shEmx2-3 samples and n = 5 shScram samples. For RNA-seq, we used n = 5 (shEmx2-3) and n = 5 (shScram) samples.

qPCR analysis

We used the qScript cDNA SuperMix (Quanta BioSciences) to generate complementary DNA (cDNA) and then performed qPCR using PerfeCTa SYBR Green FastMix (Quanta BioSciences). We assayed gene expression in triplicate for each sample and normalized the data using the housekeeping gene Actb. Primers used for qPCR were designed using the sugar glider genome and are reported in Supplementary Table 5. We analysed data from all qPCR experiments using the comparative C_t method and established statistical significance of expression differences using two-tailed t-tests.

Bulk RNA-seq

Tissues

RNA was extracted as described above and RNA-seq libraries were prepped using the TruSeq RNA Library Prep kit v2 (Illumina) and sequenced on the NovaSeq 6000 system (2 × 65 bp, paired-end). Pairwise differential expression analyses between the transcriptomes of shEmx2-3 or shScram skin were performed using DeSeq2 v.1.34.1 from BioConductor (https://bioconductor.org/)⁸⁵. Only genes differentially expressed with an adjusted P < 0.05 were considered.

Cells

Immortalized P. breviceps cells were grown in a six-well dish until confluent. RNA was collected using the TRizol reagent according to the manufacturer’s protocol. FASTQ reads were trimmed using Trimmomatic v.0.39 and aligned to the P. breviceps genome using STAR v.2.7.9a⁸⁶. Counts were generated using featureCounts (v.2.0.1)⁸⁷ (featureCounts -p -t transcript -g transcript_id -O –minOverlap 10) and RPKM was calculated using the calculation reads for a gene/(all reads for the sample/1,000,000) × (1,000/length of gene).

Whole-mount in situ hybridization analysis

The Emx2 mouse riboprobe was a generous gift from T. Capellini and has been previously described³³. Whole-mount in situ hybridizations was performed using previously described protocols⁸⁸. In brief, mouse E11.5 and E13.5 (CD-1) embryos were post-fixed with 4% PFA in 1× PBS, washed with 1× PBS, treated with 20 μg ml⁻¹ proteinase in 1× PBT for 45 min and incubated overnight with the Emx2 riboprobe at 65 °C. The next morning, the probe was washed with MABT (Maleic acid, NaCl, Tween-20) and incubated overnight with secondary anti-DIG antibodies (1:2,000) diluted in MABT, 2% Boehringer Blocking Reagent and 20% heat-treated sheep serum. After washing several times with MABT, the signal was developed by incubating with NBT/BCIP. Once a signal had developed sufficiently, the reaction was stopped by washing several times with PBT and fixing in 4% PFA overnight. The embryos were visualized using a SMZ18 stereo microscope (Nikon). NIS Elements v5 (Nikon) was used to acquire microscopy images.

HCR in situ hybridization

The sugar glider Wnt5a coding region was used to generate 20 probe binding sequences for in situ hybridization chain reaction (HCR). HCR was performed using the standard protocol for fixed frozen tissue sections available from Molecular Instruments. For visualizing Wnt5a and EMX2 co-expression, sections hybridized with the Wnt5a probe were subsequently incubated with the EMX2 antibody, according to the procedure described for immunostainings.

Stereoseq analysis

We analysed Emx2 and Wnt5a co-expression in a previously generated spatial Stereo-seq dataset⁸⁹. After examining all available datasets, we chose samples from E14.5 embryos, as this timepoint exhibited the most significant expression of both Emx2 and Wnt5a. We analysed the E1S1 sample, as it contained both the highest mean expressions of Emx2 and Wnt5a, as well as the highest percentage of spots with non-zero expression of Emx2 and Wnt5a. All of the datasets were processed and analysed using Scanpy².

We analysed the brain and craniofacial regions separately. First, we extracted the x coordinates and y coordinates of the spatial spots from the entire tissue, which we denote as X and Y, respectively. We reflected $Y\mapsto -Y$ so that the y coordinates were positive. We defined the brain region as the collection of spots, B, that were originally annotated as brain and where Y > 425. The latter threshold was chosen to remove any spots that may have been mislabelled as the brain region due to cell type deconvolution in the original study.

From trial and error and visual inspection, we defined the craniofacial region as the collection of points, C, such that:

$$\begin{array}{l}C=\{X,Y:X\le 165\,{\rm{and}}\,425 < Y < 540.96+1.22X-132.39\,{\rm{or}}\,165\\ < X < 275\,{\rm{or}}\,425\, < Y < 462-1.23(X-201.62)\}\end{array}$$

To focus specifically on craniofacial gene expression, we further removed any points that were originally annotated as belonging to either the brain or meninges regions⁵¹. To determine the likely cellular identities of spots expressing Emx2, we performed spatially informed dimensionality reduction using SpaceFlow⁹⁰. The latent embedding produced by SpaceFlow was used instead of principal component analysis to generate spatially coherent clusters, that is, clusters where spots are grouped based on both gene expression similarity and spatial proximity. We then generated a k-nearest neighbours (k-NN) graph (k = 15) to characterize spot–spot similarity. Using the k-NN graph, we used the Leiden algorithm to generate spatially resolved clusters that balanced spatial proximity with gene expression similarity, setting the clustering resolution parameter to Resolution=1.0. After constructing spatial clusters, we used differential expression analysis to determine the potential cellular identities of Emx2- and Wnt5a-expressing regions.

We used Mann–Whitney U-tests to identify spatial DEGs for each spatial cluster obtained by SpaceFlow. Specifically, we defined two particular groups of clusters. First, we aggregated the clusters in which Emx2 and Wnt5a were co-expressed significantly into a ‘Emx2 on/Wnt5a on’ cluster. Second, we aggregated the clusters in which Wnt5a was expressed significantly but Emx2 was absent into a ‘Emx2 off/Wnt5a on’ cluster. We then performed two separate differential expression tests, one between the Emx2 on/Wnt5a on cluster and the other SpaceFlow clusters and another between the Emx2 off/Wnt5a on cluster and the other clusters to identify the DEGs specific to Emx2^highWnt5a^high regions, enabling us to determine the likely cellular identities of these regions.

RNAscope

Frozen tissue sections (12 μm) of mouse embryos at E14.5 and E16.5 were used for RNA in situ hybridization using the RNAscope kit v2 (323100, Advanced Cell Diagnostics), according to the manufacturer’s instructions. The Mus musculus Wnt5a and Emx2 probes (316791 and 319001-C3, respectively; Advanced Cell Diagnostics) were used and labelled with OPAL 520 reagent (FP1487001KT, Akoya Biosciences) and TSA Vivid 650 dye (7536, Advanced Cell Diagnostics), respectively. For all of the experiments, slides were mounted using VECTASHIELD Antifade Mounting Medium with DAPI (H-1200-10; Vector Laboratories). Fluorescent images were captured with a Fluoview3000 laser-scanning microscope (Olympus) with UPLSAPO ×20/0.75 NA and ×40/1.25 NA Silicone objectives (Olympus).

Mouse transgenics

All mouse experiments performed were approved by the IACUC committee at Princeton University. All mouse strains were provided with food and water ad libitum and kept on a 12 h–12 h light–dark cycle, at a temperature of 20 °C and 60% humidity. Experiments were carried out in both males and females.

Emx2 overexpression

The Rosa^Emx2-GFP strain was a gift from D. Wu (NIH) and has been previously described⁴². The Emx2^cre strain was obtained from Riken (RBRC02272) and has been previously described⁴⁶. The Pdgfra^creERT2 strain was obtained from JAX (032770) and has been previously described⁴³. The FVB/N769Tg(tetO-Wnt5a)17Rva/J and B6.Cg Gt(ROSA)26Sortm1(rtTA*M2)Jae/J strains were obtained from JAX (022938 and 006965, respectively) and have been previously described^91,92. Both male and female mice were used in all experiments. For Emx2 induction, P30 mice were intraperitoneally injected with tamoxifen (100ul; 20 mg ml⁻¹) during 5 consecutive days. Then 7 days after the last injection, mice were intraperitoneally injected with EdU (200ul; 10 nM) and euthanized 4 h later. For Wnt5a induction, P30 mice were placed on 1 mg ml⁻¹ doxycycline-containing water ad libitum for 7 days (doxycycline water was replaced every other day). Skin tissue was collected, fixed in 4% PFA at 4 °C overnight and embedded in OCT for cryosectioning.

Phenotypic measurements

Epidermal thickness was quantified using Fiji v.2.1.0 by measuring tissue stained with haematoxylin and eosin from the base of the epidermis to the top of the epidermis. Measurements were taken exclusively from interfollicular regions (10 measurements per dorsal skin section, 3 dorsal skins sections per sample, 4 samples per treatment). Cell density and cell proliferation were quantified on tissue sections by counting the number of DAPI⁺ cells per surface area (measurements were done on three dorsal skin sections per sample, four samples per treatment). Statistical significance was assessed using a general mixed-effects model one-way ANOVA test (fixed effect = treatment; random effect = individual/sample).

LacZ enhancer reporter assays

We performed all mouse enhancer reporter assays using enSERT methodology, a recently developed strategy that uses CRISPR–Cas9 technology for site-directed insertion of enhancer reporter transgenes into the Igs2 intergenic safe-harbour site^44,93. As constructs are integrated into a specific genomic site, this method allows for reproducible and efficient testing of enhancer-reporter activity. After the constructs were synthesized, as described for our luciferase assays, we cloned them into a previously described LacZ targeting vector (Addgene, 139098)⁹³ containing a minimal Shh promoter. The resulting vectors were injected into mouse zygotes (FVB/NCrl strain; Charles River) in pools of 4–5 constructs, together with Igs2 gRNA and Cas9 protein (IDT), followed by embryo transfer into pseudopregnant females, as described previously⁴⁴. Embryos were collected at E9.5 and E11.5. After collection, the embryos were screened for the corresponding GAR insertion using junction PCR and Sanger sequencing, and stained with X-gal according to previously established protocols⁴⁴. At least 2 transgenic embryos with integration at the Igs2 locus per construct were analysed.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

Emergence of fractal geometries in the evolution of a metabolic enzyme

[ad_1]

Molecular cloning

The gene encoding CS from S. elongatus PCC 7942 was amplified from genomic DNA by PCR (Q5 High-Fidelity 2× Master Mix, New England Biolabs) and introduced into the pLIC expression vector³⁸ by Gibson cloning (Gibson Assembly Master Mix, New England Biolabs). All other extant and ancestral CS sequences were obtained as gene fragments from Twist Bioscience and introduced into the same expression vector by Gibson cloning. All CS sequences were tagged with a C-terminal polyhistidine tag for purification (tag sequence: LE-HHHHHH-Stop). For single-site mutants and deletions of the CS sequences, KLD enzyme mix (New England Biolabs) was used. Mutagenesis primers were designed with NEBasechanger and used to PCR-amplify the vector encoding for the gene that was to be changed. Resulting PCR products were added to the KLD enzyme mix and subsequently transformed. All cloned genes were verified by Sanger sequencing (Microsynth) before use in experiments.

The DNA sequences of all purified proteins and NCBI identifiers of all extant sequences are presented in Supplementary Table 2.

Protein purification

For heterologous overexpression, the vectors with the gene of interest were transformed into chemically competent Escherichia coli BL21 (DE3) cells. Transformed colonies were used to inoculate expression cultures (500 ml) made from LB medium supplemented with 12.5 g l^–1 lactose (Fisher Chemical). The cultures were incubated overnight at 30 °C and 200 r.p.m. Cells were collected by centrifugation (4,500g, 15 min, 4 °C), resuspended in buffer A (20 mM Tris, 300 mM NaCl and 20 mM imidazole, pH 8) and freshly supplemented with DNAse I (3 units µl^–1, Applichem). The cells were disrupted using a Microfluidizer (Microfluidics) in 3 cycles at 15,000 psi and centrifuged to spin down cell debris and aggregates (30,000g, 30 min, 4 °C). The clarified lysate was loaded with a peristaltic pump (Hei-FLOW 06, Heidolph) on prepacked nickel-NTA columns (5 ml Nuvia IMAC Ni-Charged, Bio-Rad) that were pre-equilibrated with buffer A. The loaded column was first washed with buffer A for 7 column volumes and then with 10% (v/v) buffer B (20 mM Tris, 300 mM NaCl and 500 mM imidazole, pH 8) in buffer A for 7 column volumes. The bound protein was eluted with buffer B and either buffer-exchanged with PD-10 desalting columns (Cytiva) into PBS or 20 mM Tris, 200 mM NaCl, pH 7.5 or further purified by size-exclusion chromatography (SEC). For SEC, the protein was injected on an ENrich SEC 650 column (Bio-Rad) with PBS as the running buffer using a NGC Chromatography System (Bio-Rad). The purity of the proteins was analysed by SDS–PAGE. After either buffer exchange or SEC, the purified proteins were flash-frozen with liquid nitrogen and stored at −20 °C before further use.

Phylogenetic analysis and ancestral sequence reconstruction

Amino acid sequences of 84 CS genes from Cyanobacteria and marine Gammaproteobacteria as the outgroup were collected from the NCBI Reference Sequence database and aligned using MUSCLE (v.3.8.31)³⁹. The maximum likelihood (ML) phylogeny was inferred from the multiple sequence alignment (MSA) using raxML (v.8.2.10)⁴⁰. The LG substitution matrix⁴¹ was used as determined by automatic best-fit model selection as well as fixed base frequencies and a gamma model of rate heterogeneity. The robustness of the ML tree topology was assessed by inferring 100 non-parametric bootstrap trees with raxML, from which Felsenstein’s and transfer bootstrap values were derived using BOOSTER (https://booster.pasteur.fr). Using PhyML (3.0)⁴², we also inferred approximate likelihood-ratio test⁴³ for branches to statistically evaluate branch support in the phylogeny.

Based on the CS tree and the MSA, ancestral sequences were inferred using the codeML package within PAML (v.4.9)⁴⁴. To adjust for gaps and the different lengths of N termini in the CS sequences, their ancestral state was determined using parsimony inference in PAUP (4.0a) based on a binary version of the MSA (1 = amino acid, 0 = gap, no residue). The state assignment for each node in the tree (amino acid or gap) was then applied to the inferred ancestral sequences.

The initial reconstruction of the crucial amino acid substitution q18L was ambiguous in the ancestors ancB and ancA. We determined that this was the case because this L residue is present in the Planktothrix clade and S. elongatus, but not in the Cyanobium/Prochlorococcus clade. Therefore, the L residue was either gained once and then lost along the lineage to Cyanobium/Prochlorococcus or it was gained convergently twice in Planktothrix and S. elongatus. We therefore added the CS sequence from the cyanobacterium Prochlorothrix hollandica to the alignment, which has been stably inferred as a sister group to S. elongatus and the Cyanobium/Prochlorococcus clade by multiple studies^45,46,47. This sequence was previously omitted from analysis as its position on the tree could not be inferred with high support. We manually added a branch to the tree, placing P. hollandica as a sister group to S. elongatus and the Cyanobium/Prochlorococcus clade. Branch lengths were reoptimized using raxML and the ancestral reconstruction repeated using PAML. The results gave high support to the substitution q18L being found in both ancB and ancA (Extended Data Fig. 8b) because P. hollandica also contains the L at position 18. This made the hypothesis of one gain and a subsequent loss in the Cyanobium/Prochlorococcus clade much more probable compared with three independent gains in Planktothrix, P. hollandica and S. elongatus.

MP analysis

Measurements were performed on a OneMP mass photometer (Refeyn). Reusable silicone gaskets (CultureWellTM, CW-50R-1.0, 50-3 mm diameter × 1 mm depth) were set up on a cleaned microscopic cover slip (1.5 H, 24 × 60 mm, Carl Roth) and mounted on the stage of the mass photometer using immersion oil (IMMOIL-F30CC, Olympus). The gasket was filled with 19 µl buffer (PBS or 20 mM Tris, 200 mM NaCl pH 7.5) to focus the instrument. Then, 1 µl of prediluted protein solution (1 µM) was added to the buffer droplet and thoroughly mixed. The final concentration of the proteins during measurement was 50 nM unless stated otherwise. Data were acquired for 60 s at 100 frames per s using AcquireMP (Refeyn, v.1.2.1). The resulting movies were processed and analysed using DiscoverMP (Refeyn, v.2.5.0). The identified protein complexes with corresponding molecular weight were plotted as histograms, and the individual oligomeric state populations appeared as peaks that were fitted by a Gaussian curve (implemented in DiscoverMP). All complexes within the respective Gaussian curve were used to calculate the fraction of CS subunits in each oligomeric state. The instrument was calibrated at least once during each measuring session using either a commercial standard (NativeMark unstained protein standard, Thermo Fisher) or a homemade calibration standard of a protein with known sizes of complexes.

For substrate titrations, the prediluted protein sample (2 µM) was incubated for 10 min with the respective substrate concentration. The same substrate concentration was also included in the buffer in the gasket that was used for focusing. For each substrate concentration, three separate measurements were performed. For pH titrations, the protein sample was diluted into the buffer with the corresponding pH value (20 mM Tris, 200 mM NaCl pH 7–9.5). The dilution factor was at least 200, including predilution and final dilution in the gasket. For each pH value, two separate measurements were performed.

Native mass spectrometry

The purified protein samples were buffer-exchanged into 200 mM ammonium acetate by using centrifugal filter devices (Amicon Ultra) and three successive rounds of concentration and dilution. The concentration of protein was determined by UV absorbance (NanoDrop spectrophotometer, Thermo Fisher) and diluted into aliquots at appropriate monomeric concentrations. Nanoelectrospray was carried out in positive-ion mode on a Q Exactive UHMR mass spectrometer (Thermo Fisher), using gold-coated capillaries prepared in-house and the application of a modest backing pressure (about 0.5 mbar). Sulfur hexafluoride was introduced into the collision ‘HCD’ cell to improve transmission, and the instrument was operated at a resolution of 6,250 (at 200 m/z), with ‘high detector optimization’, and a trapping pressure in the HCD cell set to 4. The rest of the parameters were optimized for each sample, with the following ranges: capillary voltage of 1.2–1.5 kV; capillary temperature of 100–250 °C; in-source trapping from −15 to −150 V; injection times of 50–100 ms; and 1–10 microscans. Mass spectra were deconvolved using UniDec⁴⁸.

Kinetic enzyme assays

For the CS kinetic assays, the colorimetric quantification of thiol groups was used based on 5,5′-dithiobis-(2-nitrobenzoic acid) (DTNB)^49,50. The photospectrometric reactions were carried out in 50 mM Tris pH 7.5, 10 mM KCl, 0.1 mg ml^–1 DTNB and 25 nM protein concentration at 25 °C. To measure K_m values, one substrate was saturated and added to the reaction mix (1 mM oxaloacetate or 0.5 mM acetyl-CoA). The other substrate was varied in concentration and added last to start the reaction. For kinetic measurements at non-saturating substrate concentrations, the protein was diluted only immediately before the reaction start to prevent the disassembly of complexes and added last to the reaction mix. Reaction progress was followed by measuring the appearance of 2-nitro-5-thiobenzoate at 412 nm (extinction coefficient of 14.150 M⁻¹ cm⁻¹) in a plate reader (Infinite M Nano+, Tecan) using Tecan i-control (v.3.9.1). Data analysis and determination of Michaelis–Menten kinetic parameters was done using GraphPad Prism (v.8.4.3). For the kinetic assays with the cys4 variant, the protein was dialysed in a buffer with a glutathione redox system to induce the formation of disulfide bonds of the cysteine residues (50 mM Na₂HPO₄, 150 mM NaCl, 1 mM glutathione and 0.5 mM glutathione disulfide, pH 8). After overnight dialysis, part of the protein sample was used for kinetic assays. The remainder was reduced by incubation with 10 mM dithiothreitol for 3 h at 4 °C and again used for kinetic assays. To exclude additional effects by the treatment itself, the WT SeCS was handled accordingly (dialysis in redox buffer and reduction with dithiothreitol) and measured kinetically for comparison.

Box counting

To quantify fractal scaling, we used a fixed grid scan. The images of the class averages of the 18mer and 54mer assemblies were overlaid with a non-overlapping regular grid (Adobe Illustrator, v.24.0.2). The squares that were needed to fill out the structure were manually counted. This process was repeated for nine different box sizes of the grid (85–17 px). The entire procedure was replicated for three separate grid orientations for both assemblies. Linear regression was performed using GraphPad Prism (v.8.4.3).

Cultivation of S.
elongatus and sample preparation for metabolomics analysis

S. elongatus PCC 7942 was genetically modified to harbour variants of CS by homologous recombination as previously described⁵¹. The standard vector pSyn_6 (Thermo Fisher Scientific) was used as the backbone. A homology cassette was constructed by amplification and extraction of the CS gene and 1,000 bp of the neighbouring homologous regions by PCR from genomic DNA of WT S. elongatus PCC 7942. These were introduced into the pSyn_6 vector that included a spectinomycin-resistance gene to select for transformants. The respective sequence changes of the CS were introduced into this vector (L18Q) to create the corresponding homology cassette. The constructed homology cassettes (WT, L18Q) were transformed into WT S. elongatus PCC 7942 and plated on BG11 plates with 10 µg ml^–1 spectinomycin for selection. Transformants were re-streaked on fresh BG11 plates with spectinomycin, and resulting colonies were analysed for successful integration through the extraction of genomic DNA. All strains were verified by PCR amplification of the introduced cassette (primers were designed to bind outside the introduced DNA region) and Sanger sequencing. All sequences of the homology cassettes are presented in Supplementary Table 3.

S. elongatus PCC 7942 cultures and genetically modified strains were grown in BG11 medium at 30 °C, 100 r.p.m., ambient CO₂ levels and alternating light conditions: 12 h of light (photon flux of 120 μmol m⁻² s⁻¹) and 12 h of darkness. Before the growth experiment, precultures were entrained for 5 days in the circadian conditions to synchronize cells. Then 3 main cultures (50 ml) were set up from 3 independent precultures and inoculated to an OD₇₅₀ of 0.025 or 0.05. Samples for metabolomics analysis were cultivated in specific flasks to facilitate the isolation of culture solution through a syringe valve, which led to slower growth behaviour compared with the standard flasks. The samples were taken at 6 different time points (days 3, 5 and 7) after a light and a dark period.

For recovery experiments under nitrogen deficiency, S. elongatus strains were grown in BG11 medium at full light to an OD₇₅₀of 0.5 in triplicate. The cells were then shifted to medium without a nitrogen source. To do this, the cells were washed twice with BG11 without nitrate and then continuously cultivated in BG11 without nitrogen. The cells underwent chlorosis and fully bleached in the subsequent days. After 14 and 20 days, a serial dilution of the respective cultures was spotted on BG11 agar plates and incubated for 7 further days for recovery.

Sample preparation for metabolomics analysis

The culture volume (1 ml) was taken from the shaking flask through a syringe and immediately quenched in 1 ml 70 % methanol that was precooled in a −80 °C freezer. The sample was mixed and centrifuged (10 min, −10 °C, 13,000g). The supernatant was removed and the pellet was stored at −80 °C until the endometabolome was extracted. At each time point, the cell number and size were measured for each culture using a Coulter counter (Multisizer 4e, Beckman Coulter). The respective biovolume for each cell pellet was then calculated and used to infer a normalized amount of extraction fluid for each sample (extraction fluid = 20,00 × biovolume). All steps of the metabolome extraction were performed on ice and with precooled (−20 °C) reagents. To extract the metabolites, the calculated amount of extraction fluid (50% (v/v) methanol, 50% (v/v) TE buffer pH 7.0) was added to the cell pellets together with the same amount of chloroform. The samples were vortexed and incubated for 2 h at 4 °C while shaking. The phases were then separated by centrifugation (10 min, −10 °C, 13,000g). The upper phase was extracted with a syringe and the same amount of chloroform added again. After mixing, the sample was centrifuged again (10 min, −10 °C, 13,000g) to get remove residual cell fragments and pigments. The upper phase was isolated, added to LC–MS vials and stored at −20 °C until analysis.

Quantification of intracellular metabolites from S.
elongatus by LC–MS/MS

Quantitative determination of acetyl-CoA and citrate was performed using LC–MS/MS. The chromatographic separation was performed on an Agilent Infinity II 1290 HPLC system (Agilent) using a Kinetex EVO C18 column (150 × 2.1 mm, 3 μm particle size, 100 Å pore size, Phenomenex) connected to a guard column of similar specificity (20 × 2.1 mm, 3 μm particle size, Phenomoenex). For acetyl-CoA, a constant flow rate of 0.25 ml min^–1 with mobile phase A being 50 mM ammonium acetate in water at a pH of 8.1 and phase B being 100% methanol at 25 °C was used. The injection volume was 1 µl. The mobile phase profile consisted of the following steps and linear gradients: 0–0.5 min constant at 5% B; 0.5–6.5 min from 5 to 80% B; 6.5–7.5 min constant at 80% B; 7.5–7.6 min from 80 to 5% B; and 7.6 to 10 min constant at 5% B. An Agilent 6470 mass spectrometer (Agilent) was used in positive mode with an electrospray ionization (ESI) source and the following conditions: ESI spray voltage of 4,500 V; nozzle voltage of 1,500 V; sheath gas of 400 °C at 11 l min^–1; nebulizer pressure of 30 psi; and drying gas of 250 °C at 11 l min^–1. The target analyte was identified based on the two specific mass transitions (810.1 → 428 and 810.1 → 302.2) at a collision energy of 35 V and its retention time compared with standards.

For citrate, a constant flow rate of 0.2 ml min^–1 with mobile phase A being 0.1% formic acid in water and phase B being 0.1% formic acid methanol at 25 °C was used. The injection volume was 10 µl. The mobile phase profile consisted of the following steps and linear gradients: 0–5 min constant at 0% B; 5–6 min from 0 to 100% B; 6–8 min constant at 100% B; 8–8.1 min from 100 to 0% B; and 8.1 to 12 min constant at 0% B. An Agilent 6495 ion funnel mass spectrometer (Agilent) was used in negative mode with an ESI source and the following conditions: ESI spray voltage of 2,000 V; nozzle voltage of 500 V; sheath gas of 260 °C at 10 l min^–1; nebulizer pressure of 35 psi; and drying gas of 100 °C at 13 l min^–1. The target analyte was identified based on the two specific mass transitions (191 → 111.1 and 191 → 85.1) at a collision energy of 11 and 14 V and its retention time compared with standards.

Chromatograms were integrated using MassHunter software (Agilent). Absolute concentrations were calculated based on an external calibration curve prepared in sample matrix.

Negative-stain EM

Carbon-coated copper grids (400 mesh) were hydrophilized by glow discharging (PELCO easiGlow, Ted Pella). Next, 5 µl of 450 nM protein suspensions were applied onto the hydrophilized grids and stained with 2% uranyl acetate after a short washing step with double-distilled H₂O. Samples were analysed using a JEOL JEM-2100 transmission electron microscope with an acceleration voltage of 120 kV. A 2k F214 FastScan CCD camera (TVIPS) was used for image acquisition. Alternatively, a JEOL JEM1400 TEM (operated at 80 kV) with a 4k TVIPS TemCam XF416 camera was used. For 2D class averaging, images were taken manually and processed with cisTEM⁵². The following number of particles were averaged: 1,491 particles for 18mers; 200 particles for 54mers; and 186 for 36mers. The 36mer and 54mer particles were isolated from an extended dataset, in which we specifically looked for larger assemblies. The exact percentage of complexes larger than 18mers was difficult to estimate because of very strong preferential orientation. Most particles seemed to have landed not on the face of the triangle but on one of its edges or even one of its tips (Extended Data Fig. 1). To obtain an estimate, another dataset of 150 micrographs without a bias towards larger assemblies was collected. All particles were manually counted for these micrographs and included the assemblies that were laying on their edge and appeared as rectangles. By measuring the edge length, we could assign them to be either a 36mers (30 nm) or 54mers (40 nm). The analysis revealed that under negative-stain TEM conditions (450 nM) approximately 92.8% of detected assemblies were identified as 18mers (1,773 particles), 3.5% as 36mers (66 particles) and 3.8% as 54mers (72 particles). Our estimate of the abundance should still be taken with care and by comparison with our SAXS data, which showed that large complexes only start being reasonably common above 25 µM protein concentration. For the H369R variant of SeCS, a protein concentration of 450 nM was used and 136 particles were averaged to produce the 2D class average of the 18mer shown in Extended Data Fig. 6d.

Crystallography and structure determination

Crystallization was performed using the sitting-drop method at 20 °C in 250 nl drops (Crystal Gryphon, Art Robbins Instruments) consisting of equal parts of protein and precipitation solutions (Swissci 3 Lens Crystallisation Plate). Protein solutions of 250 µM were incubated with 5 mM acetyl-CoA for 10 min at room temperature to induce disassembly into hexamers. The crystallization condition was 0.1 M citrate pH 5.5,and 2.0 M ammonium sulfate. Before data collection, crystals were flash-frozen in liquid nitrogen using a cryo-solution that consisted of motherliquor supplemented with 20% (v/v) glycerol. Data were collected under cryogenic conditions at P13, Deutsches Elektronen-Synchrotron. Data were processed using XDS and scaled with XSCALE⁵³. All structures were determined by molecular replacement with PHASER⁵⁴, manually built in WinCOOT (v.0.9.6)⁵⁵ and refined with PHENIX (v.1.19.2)⁵⁶. The search model for the structure was the hexameric Δ2–6 variant. Images of the structure were generated using PyMOL (v.2.5.2).

Cryo-EM

For cryo-EM sample preparation, 4.5 µl of the protein sample (22.5 µM) was applied to glow-discharged Quantifoil 2/1 grids, blotted for 4 s with force 4 in a Vitrobot Mark III (Thermo Fisher) at 100% humidity and 4 °C, and plunge frozen in liquid ethane, cooled by liquid nitrogen. Cryo-EM data were acquired with a FEI Titan Krios transmission electron microscope (Thermo Fisher) using SerialEM software⁵⁷. Movie frames were recorded at a nominal magnification of ×29,000 using a K3 direct electron detector (Gatan). The total electron dose of about 55 electrons per Å² was distributed over 30 frames at a pixel size of 1.09 Å. Micrographs were recorded in a defocus range from −0.5 to −3.0 μm.

Image processing, classification and refinement

For the SeCS 18mer, all processing steps were carried out in cryoSPARC (v.3.2.0)⁵⁸. A total of 1,408 movies were aligned using the patch motion correction tool, and contrast transfer function (CTF) parameters were determined using the patch CTF tool. An initial set of 10,173 particles were acquired through several rounds of blob picking, 2D classification and template picking for training a Topaz convolutional neural network particle picking model⁵⁹. From all the corrected micrographs, 273,259 particles were extracted in a box size of 350 by 350 pixels at a pixel size of 1.09 Å using the Topaz extract tool together with the trained model. Overall, 224,041 particles were selected for the ab initio reconstruction after removing poor particles through 2D classification. The initial density map was then three-dimensionally (3D) classified and refined using the heterogenous refinement tool, which resulted in three classes. The dominant class (56.7% particles) was subjected to another round of heterogenous refinement, which led to two classes. A 3D non-uniform refinement of the main class (79.8% particles) imposing a C3 symmetry, followed by a local CTF refinement produced a final resolution of 3.93 Å (GSFSC = 0.143), which was used for model building. Local resolution of the density map was calculated with the local resolution estimation tool.

For the ∆2–6 sample, cryo-EM micrographs were processed on the fly using the Focus software package⁶⁰ if they passed the selection criteria (iciness < 1.05, drift 0.4 Å < x < 70 Å, defocus 0.5 µm < x < 5.5 µm, estimated CTF resolution < 6 Å). Micrograph frames were aligned using MotionCor2 (ref. ⁶¹) and the CTF for aligned frames was determined using GCTF⁶². From 5,419 acquired micrographs 1,687,951 particles were picked using the Phosaurus neural network architecture from crYOLO⁶³. Particles were extracted with a pixel box size of 256 scaled down to 96 using RELION (v.3.1)⁶⁴ and underwent several rounds of reference-free 2D classification. Overall, 1,271,457 selected particles (∆2–6) were re-extracted with a box size of 256 and imported into Cryosparc (v.2.3)⁵⁸. For each sample, ab initio models were generated and passed through heterogeneous classification and refinement. Selected particles were re-imported to RELION and underwent several rounds of refinement, CTF-refinement (estimation of anisotropic magnification, fit of per-micrograph defocus and astigmatism and beamtilt estimation) and Bayesian polishing⁶⁵. Final C1 refinement produced models with an estimated resolution of 3.1 Å for ∆2–6 (gold standard FSC analysis of two independent half-sets at the 0.143 cut-off). Local resolution and 3D FSC plots were calculated using RELION and the “Remote 3DFSC Processing Server” web interface⁶⁶, respectively.

For the H369R SeCS 54mer and 18mer, all processing steps were carried out in cryoSPARC (v.4.4.0). In total, 29,126 movies were aligned using the patch motion correction tool, and CTF parameters were determined using the patch CTF tool. Next, 8,583 micrographs of estimated CTF fit ≤ 3.5 Å were selected for subsequent analysis. A Topaz particle picking model was generated by running several rounds of Topaz train and Topaz extract from an initial set of 150 manually picked particles. A total of 95,268 particles were picked using the trained model and extracted in a box size of 1,200 by 1,200 pixels at a pixel size of 0.79 Å. The particles were downsampled to a pixel size of 1.58 Å before 2D classification. 2D classes corresponded to the SeCS 54mer were selected to reconstruct two densities map using the ab initio reconstruction tool. All the extracted particles were re-aligned and 3D classified by running the heterogenous refinement tool using the density map corresponded to an intact 54mer as a reference. The 3D class (18.0% particles) was further refined by non-uniform refinement, which resulted in a final resolution of 5.91 Å (GSFSC = 0.143), which was used for model building. To reconstruct the mutant 18mer, 899,109 particles were picked using a 2D class corresponding to the 18mer as a template and extracted in a box size of 500 by 500 pixels at a pixel size of 0.79 Å. A total of 552,353 particles were selected from 2D classification to generate 3 initial maps. The major class was 3D classified and aligned, followed by a non-uniform refinement to produce the final 18mer density at 3.34 Å (GSFSC = 0.143). Local resolution of the density map was calculated with the local resolution estimation tool, and preferred orientation was assessed using the orientation diagnostics tool.

For 18meric SeCS, initial models were generated separately from their protein sequences using alphaFold⁶⁷ and then fitted as rigid bodies into the density using UCSF Chimera. The model was manually rebuilt using WinCoot (v.0.9.6)⁵⁵. Non-crystallographic symmetry constraints were manually defined in PHENIX (v.1.19.2)⁵⁶ so that each monomer within one hexamer is linked to the two corresponding monomers in the other two hexamers (corresponding to a C3 symmetric refinement of the 18mer). For the Δ2–6 hexamer, a hexameric subunit was extracted from the 18mer model as a starting model for refinement. The model was firstly rigid-body fitted into the density, and manually refined in WinCoot (v.0.9.6)⁵⁵. Both models were subjected to real-space refinements against the respective density maps using phenix.real_space_refine implemented in PHENIX (v.1.19.2)⁵⁶. Images of the structures were generated using PyMOL (v.2.5.2). For the 54mer structure of SeCS H369R, we used the dimers extracted from the WT 18mer structure as our starting model and fitted them as rigid bodies into the density using UCSF Chimera. We then truncated all side chains using pdbtools within PHENIX (v.1.19.2)⁵⁶. The structure was then subjected to one round of real space refinement using default parameters in PHENIX. For the 18mer structure of SeCS H369R, we also used the 18mer SeCS structure as the starting model. Individual dimers were first fitted as rigid bodies into the density using UCSF Chimera. We then subjected the structure to one round of flexible fitting with default parameters, followed by refinement with default parameters using the Namdinator server⁶⁸. The model was then manually rebuilt in WinCoot (v.0.9.6)⁵⁵. In this model, we truncated the substrate lids (residues 220–312, which are not part of the fractal interface) in all chains owing to poorly resolved density in our map, which made it difficult not to introduce register errors during refinement.

SAXS data collection and analysis

SAXS experiments were carried out at the BM29 beamline at the ESRF⁶⁹ using a PILATUS3X 2M photon counting detector (DECTRIS) at a fixed distance of 2,827 m. Protein samples were prepared in 25 mM Tris-HCl buffer pH 7.5 and 200 mM NaCl as a dilution series. Buffer matching was achieved by dialysis and all measurements were carried out at 20 °C. The sample delivery and measurements were performed using a 1 mm diameter quartz capillary, which is part of the BioSAXS automated sample changer unit (Arinax). Before and after each sample measurement, the corresponding buffer was measured and averaged. A total of ten frames (one frame per second) were taken for each sample. All experiments were conducted with the following parameters: beam current of 200 mA; flux of 2.6 × 1,012 photons s^–1 at sample position; wavelength of 1 Å; and estimated beam size of 200 × 200 µm. Processing and analysis of collected SAXS data were performed using ScÅtter IV⁷⁰.The R_g was determined by Guinier approximation. Plotting of the SAXS profiles and Guinier regions used BioXTAS RAW⁷¹.

Construction of atomic models of the 54mers using 18mers

We used the align, translate, and rotate commands within PyMOL (v.2.5.2) to model how a 54mer complex would assemble if the 4.0° and 4.2° dimer rotations and 60° dihedral angle between dimers that are observed in the 18mer structure were applied. The rotation was applied to the connecting dimers of the three 18mer-subcomplexes that built the 54mer. To do this, copies of the hexamers that constitute the 18mer were rotated by 120°, so as to overlay the corner dimers by edge dimers. Two 18mer copies were subsequently connected to the rotated corners by two steps of structural alignment, which placed the residues that should form the third interface 210 Å from each other (Extended Data Fig. 2g).

Calculation of R
_g values

Calculation of R_g values was done using gmx gyrate from the GROMACS 2022.2 simulation package⁷² from the atomic models of the 6mer, 18mer and the 54mer.

Displacement vectors, rotational axes and dihedral angles of atomic models

Symmetry axes were generated with AnAnaS⁷³, and rotation axes and angles were calculated using PyMOL (v.2.5.2) and a compatible script. Displacement vectors were drawn between Cα atoms of the aligned structures using the object argument and cgo-arrow. Dihedral angles between dimers across the fractal interface were calculated in PyMOL (v.2.5.2). The centre of mass of both dimers, as well as of one monomer from each dimer, was first calculated with the com command. The dihedral angle was then calculated using get_dihedral along the axis defined by the two centres of mass of the dimers.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

The variation and evolution of complete human centromeres

[ad_1]

Cell lines

CHM1hTERT (CHM1) cells were originally isolated from a hydatidiform mole at Magee-Womens Hospital. Cryogenically frozen cells from this culture were grown and transformed using human telomerase reverse transcriptase (hTERT) to immortalize the cell line. This cell line has been authenticated by short-tandem-repeat analysis by Cell Line Genetics and has tested negative for mycoplasma contamination. Human HG00733 lymphoblastoid cells were originally obtained from a female Puerto Rican child, immortalized with the Epstein–Barr Virus (EBV) and stored at the Coriell Institute for Medical Research. This cell line has been authenticated using a multiplex PCR assay with six autosomal microsatellite markers and has tested negative for mycoplasma contamination. Chimpanzee (Pan troglodytes, Clint, S006007) fibroblast cells were originally obtained from a male western chimpanzee named Clint (now deceased) at the Yerkes National Primate Research Center and immortalized with EBV. Orangutan (Pongo abelii, Susie, PR01109) fibroblast cells were originally obtained from a female Sumatran orangutan named Susie (now deceased) at the Gladys Porter Zoo, immortalized with EBV and stored at the Coriell Institute for Medical Research. Macaque (Macaca mulatta; AG07107) fibroblast cells were originally obtained from a female rhesus macaque of Indian origin and stored at the Coriell Institute for Medical Research. The chimpanzee, orangutan and macaque cell lines have not yet been authenticated or assessed for mycoplasma contamination to our knowledge.

Cell culture

CHM1 cells were cultured in complete AmnioMax C-100 Basal Medium (Thermo Fisher Scientific, 17001082) supplemented with 15% AmnioMax C-100 Supplement (Thermo Fisher Scientific, 12556015) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). HG00733 (Homo sapiens) cells were cultured in RPMI-1650 medium (Sigma-Aldrich, R8758) supplemented with 15% fetal bovine serum (FBS; Thermo Fisher Scientific, 16000-044) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). Chimpanzee (P. troglodytes; Clint; S006007) and macaque (Macaque mulatta; AG07107) cells were cultured in MEM α containing ribonucleosides, deoxyribonucleosides and l-glutamine (Thermo Fisher Scientific, 12571063) supplemented with 12% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). Orangutan (P. abelii; Susie; PR01109) cells were cultured in MEM α containing ribonucleosides, deoxyribonucleosides and l-glutamine (Thermo Fisher Scientific, 12571063) supplemented with 15% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). All cells were cultured in a humidity-controlled environment at 37 °C under 95% O₂.

DNA extraction, library preparation and sequencing

PacBio HiFi data were generated from the CHM1 and HG00733 genomes as previously described²¹ with some modifications. In brief, high-molecular-weight DNA was extracted from cells using a modified Qiagen Gentra Puregene Cell Kit protocol⁴⁷. High-molecular-weight DNA was used to generate PacBio HiFi libraries using the Template Prep Kit v1 (PacBio, 100-259-100) or SMRTbell Express Template Prep Kit v2 (PacBio, 100-938-900) and SMRTbell Enzyme Clean Up kits (PacBio, 101-746-400 and 101-932-600). Size selection was performed with SageELF (Sage Science, ELF001), and fractions sized 11 kb, 14 kb, 15 kb or 16 kb (as determined by FEMTO Pulse (Agilent, M5330AA)) were chosen for sequencing. Libraries were sequenced on the Sequel II platform with seven or eight SMRT Cells 8M (PacBio, 101-389-001) per sample using either Sequel II Sequencing Chemistry 1.0 (PacBio, 101-717-200) or 2.0 (PacBio, 101-820-200), both with 2 h pre-extension and 30 h videos, aiming for a minimum estimated coverage of 30× in PacBio HiFi reads (assuming a genome size of 3.1 Gb). Raw CHM1 data were processed using DeepConsensus⁴⁸ (v.0.2.0) with the default parameters. Raw HG00733 data were processed using the CCS algorithm (v.3.4.1) with the following parameters: –minPasses 3 –minPredictedAccuracy 0.99 –maxLength 21000 or 50000.

Ultra-long ONT data were generated from the CHM1, HG00733, chimpanzee, orangutan and macaque genomes according to a previously published protocol⁴⁹. In brief, 3–5 × 10⁷ cells were lysed in a buffer containing 10 mM Tris-Cl (pH 8.0), 0.1 M EDTA (pH 8.0), 0.5% (w/v) SDS and 20 μg ml⁻¹ RNase A (Qiagen, 19101) for 1 h at 37 °C. Then, 200 μg ml⁻¹ proteinase K (Qiagen, 19131) was added, and the solution was incubated at 50 °C for 2 h. DNA was purified through two rounds of 25:24:1 (v/v) phenol–chloroform–isoamyl alcohol extraction followed by ethanol precipitation. Precipitated DNA was solubilized in 10 mM Tris (pH 8.0) containing 0.02% Triton X-100 at 4 °C for 2 days. Libraries were constructed using the Ultra-Long DNA Sequencing Kit (ONT, SQK-ULK001) with modifications to the manufacturer’s protocol. Specifically, around 40 μg of DNA was mixed with FRA enzyme and FDB buffer as described in the protocol and incubated for 5 min at room temperature, followed by a 5 min heat-inactivation at 75 °C. RAP enzyme was mixed with the DNA solution and incubated at room temperature for 1 h before the clean-up step. Clean-up was performed using the Nanobind UL Library Prep Kit (Circulomics, NB-900-601-01) and eluted in 225 μl EB. Then, 75 μl of library was loaded onto a primed FLO-PRO002 R9.4.1 flow cell for sequencing on the PromethION, with two nuclease washes and reloads after 24 and 48 h of sequencing.

Additional ONT data were generated from the CHM1, HG00733, chimpanzee, orangutan and macaque genomes according to a previously published protocol²¹. In brief, high-molecular-weight DNA was extracted from cells using a modified Qiagen Gentra Puregene protocol⁴⁷. High-molecular-weight DNA was prepared into libraries with the Ligation Sequencing Kit (SQK-LSK110) from ONT and loaded onto primed FLO-PRO002 R9.4.1 flow cells for sequencing on the PromethION system, with two nuclease washes and reloads after 24 and 48 h of sequencing. All ONT data were base-called using Guppy (v.5.0.11) with the SUP model.

Targeted sequence assembly and validation of centromeric regions

To generate complete assemblies of centromeric regions from the CHM1, HG00733, chimpanzee, orangutan and macaque genomes, we first assembled each genome from PacBio HiFi data (Supplementary Table 1) using hifiasm²⁴ (v.0.16.1). The resulting PacBio HiFi contigs were aligned to the T2T-CHM13 reference genome⁴ (v.2.0) using minimap2⁵⁰ (v.2.24) with the following parameters: -I 15G -a –eqx -x asm20 -s 5000. Fragmented centromeric contigs were subsequently scaffolded with ultra-long (>100 kb) ONT data generated from the same source genome using a method that takes advantage of SUNKs (Supplementary Fig. 1; https://github.com/arozanski97/SUNK-based-contig-scaffolding). In brief, SUNKs (k = 20 bp) were identified from the CHM1 PacBio HiFi whole-genome assembly using Jellyfish (v.2.2.4) and barcoded on the CHM1 PacBio HiFi centromeric contigs as well as all ultra-long ONT reads. PacBio HiFi centromeric contigs sharing a SUNK barcode with ultra-long ONT reads were subsequently joined together to generate contiguous assemblies that traverse each centromeric region. The base accuracy of the assemblies was improved by replacing the ONT sequences with locally assembled PacBio HiFi contigs generated using HiCanu⁷ (v.2.1.1).

We validated the construction of each centromere assembly using four different methods. First, we aligned native PacBio HiFi and ONT data from the same source genome to each whole-genome assembly using pbmm2 (v.1.1.0) (for PacBio HiFi data; https://github.com/PacificBiosciences/pbmm2) or Winnowmap⁵¹ (v.1.0) (for ONT data) and assessed the assemblies for uniform read depth across the centromeric regions using IGV⁵² and NucFreq²². We next assessed the concordance between the assemblies and raw PacBio HiFi data using VerityMap²⁷, which identifies discordant k-mers between the two and flags them for correction. We then assessed the concordance between the assemblies and ONT data using GAVISUNK²⁸, which identifies concordant SUNKs between the two. Finally, we estimated the accuracy of the centromere assemblies from mapped k-mers (k = 21) using Merqury (v.1.1)⁵³ and publicly available Illumina data from each genome (Extended Data Table 1). We estimated the QV of the centromeric regions with the following formula:

$$-10\times \,\log (1-{(1-(\text{number of erroneous}k\text{-mers}/\text{total number of}k\text{-mers}))}^{(1/k)})$$

FISH and spectral karyotyping

To determine the karyotype of the CHM1 genome, we first prepared metaphase chromosome spreads by arresting CHM1 cells in mitosis via the addition of KaryoMAX Colcemid Solution (0.1 µg ml⁻¹, Thermo Fisher Scientific, 15212012) to the growth medium for 6 h. Cells were collected by centrifugation at 200g for 5 min and incubated in 0.4% KCl swelling solution for 10 min. Swollen cells were pre-fixed by the addition of freshly prepared methanol:acetic acid (3:1) fixative solution (~100 μl per 10 ml total volume). Pre-fixed cells were collected by centrifugation at 200g for 5 min and fixed in methanol:acetic acid (3:1) fixative solution. Spreads were dropped on a glass slide and incubated on a heating block at 65 °C overnight. Before hybridization, slides were treated with 1 mg ml⁻¹ RNase A (Qiagen, 19101) in 2× SSC for at least 45 min at 37 °C and then dehydrated in a 70%, 80% and 100% ethanol series for 2 min. Denaturation of spreads was performed in 70% formamide/2× SSC solution at 72 °C for 1.5 min and was immediately stopped by immersing the slides into an ethanol series pre-chilled to −20 °C.

Fluorescent probes for spectral karyotyping were generated in-house. Individual fluorescently labelled whole-chromosome paints were obtained from Applied Spectral Imaging. Paints were provided in a hybridization buffer and mixed 1:1 for indicated combinations. Labelled chromosome probes and paints were denatured by heating to 80 °C for 10 min before applying them to denatured slides. Spreads were hybridized to probes under a HybriSlip hybridization cover (Grace Bio-Labs, 716024) sealed with Cytobond (SciGene, 2020-00-1) in a humidified chamber at 37 °C for 48 h. After hybridization, the slides were washed three times in 50% formamide/2× SSC for 5 min at 45 °C, 1× SSC solution at 45 °C for 5 min twice, and at room temperature once. The slides were then rinsed with double-deionized H₂O, air-dried and mounted in Vectashield-containing DAPI (Vector Laboratories, H-1200-10).

For spectral karyotyping, images were acquired using LSM710 confocal microscope (Zeiss) with the 63×/1.40 NA oil-immersion objective and ZEN (v.3.7) software. Segmentation, spectral unmixing and identification of chromosomes were performed using an open-source karyotype identification via spectral separation (KISS) analysis package for Fiji⁵⁴ (v.2.13.1), freely available online (http://research.stowers.org/imagejplugins/KISS_analysis.html). A detailed description of chromosome paints, hybridization and analysis procedures was reported previously⁵⁵.

For individually painted chromosomes, z stack images were acquired on the Nikon Ti-E microscope equipped with a 100× objective NA 1.45, Yokogawa CSU-W1 spinning disk and Flash 4.0 sCMOS camera with NIS-Elements AR (v.3.2) software. Image processing was performed in Fiji⁵⁴ (v.2.13.1).

Strand-seq analysis

To assess the karyotype of the CHM1 genome, we prepared strand-seq libraries from CHM1 cells using a previously published protocol^56,57. We sequenced the mono- and dinucleosome fractions separately, with the mononucleosomes sequenced with 75 bp, paired-end Illumina sequencing, and the dinucleosomes sequenced with 150 bp, paired-end Illumina sequencing. We demultiplexed the raw sequencing data based on library-specific barcodes and converted them to FASTQ files using Illumina standard software. We aligned the reads in the FASTQ files to the T2T-CHM13 reference genome⁴ (v.2.0) using BWA⁵⁸ (v.0.7.17-r1188), sorted the alignments using SAMtools⁵⁹ (v.1.9) and marked duplicate reads using sambamba⁶⁰ (v.1.0). We merged the BAM files for the mono- and dinucleosome fractions of each cell using SAMtools⁵⁹ (v.1.9). We used breakpointR (v.1.18)⁶¹ to assess the quality of generated strand-seq libraries with the following parameters: windowsize = 2000000, binMethod = ‘size’, pairedEndReads = TRUE, min.mapq = 10, background = 0.1, minReads = 50. We filtered the libraries based on the read density, level of background reads and level of genome coverage variability⁶². In total, 48 BAM files were selected for all subsequent analysis and are publicly available. We detected changes in strand-state inheritance across all strand-seq libraries using the R package AneuFinder⁶³ with the following parameters: variable.width.reference = <merged BAM of all 48 strand-seq libraries>, binsizes = windowsize, use.bamsignals = FALSE, pairedEndReads = TRUE, remove.duplicate.reads = TRUE, min.mapq = 10, method = ‘edivisive’, strandseq = TRUE, cluster.plots = TRUE, refine.breakpoints = TRUE. We extracted a list of recurrent strand-state changes reported as sister chromatid exchange hotspots by AneuFinder. With this analysis, we identified reciprocal translocations between chromosomes 4q35.1/11q24.3 and 16q23.3/17q25.3 (see below) and established the overall copy number for each chromosome and strand-seq library.

To identify the reciprocal translocation breakpoints between chromosomes 4q35.1/11q24.3 and 16q23.3/17q25.3 in the CHM1 genome, we first aligned CHM1 PacBio HiFi reads to the T2T-CHM13 reference genome⁴ (v.2.0) using pbmm2 (v.1.1.0) and used BEDtools⁶⁴ intersect (v.2.29.0) to define putative translocation regions based on AneuFinder analysis (described above). We extracted PacBio HiFi reads with supplementary alignments using SAMtools⁵⁹ (v.1.9) flag 2048. Using this method, we were able to identify the precise breakpoint of each translocation. Note that, for the reciprocal translocation between chromosomes 4q35.1/11q24.3, we report two breakpoints in each chromosome due to the presence of a ~97–98 kb deletion in the translocated homologues (Supplementary Fig. 3). The breakpoints are located at chromosome 4: 187112496/chromosome 11: 130542388, chromosome 4: 187209555/chromosome 11: 130444240, and chromosome 16: 88757545/chromosome 17: 81572367 (in T2T-CHM13 v.2.0).

Sequence identity across centromeric regions

To calculate the sequence identity across the centromeric regions from CHM1, CHM13 and 56 other diverse human genomes (generated by the HPRC¹⁰ and HGSVC²³), we performed three analyses that take advantage of different alignment methods. In the first analysis, we performed a pairwise sequence alignment between contigs from the CHM1, CHM13 and diverse genomes using minimap2⁵⁰ (v.2.24) and the following command: minimap2 -I 15G -K 8G -t {threads} -ax asm20 –secondary=no –eqx -s 2500 {ref.fasta} {query.fasta}. We chose these minimap2 parameters after testing several options and identifying optimal ones for alignment between repetitive and/or structurally divergent regions in diploid human genomes. Specifically, we chose -I 15G to provide additional memory for aligning between centromeric regions (the default is 4G and sometimes throws an error because of the large number of potential alignments). We also chose -K 8G because it allows for 8 Gb of sequence to be loaded into memory at a time. This is enough for a typical human diploid genome (~6 Gb) to be loaded. If we had left it at the default (500M), only a subset of contigs would be loaded at a time, and once the shortest contigs align, we would be left with only one thread aligning the longest contig. We therefore chose to increase this parameter so that the whole assembly is aligned at one time. We also chose to use -ax asm20 as it allows for sequences that are up to 20% divergent to be aligned. This is more permissive to alternative α-satellite HOR structures and sequence compositions than the other alignment options (for example, asm5 and asm10). We also opted to use –secondary=no to prevent secondary alignments from the same contig, thereby preventing multi-mapping and ensuring that the query would only align once to the reference. We added –eqx to allow us to parse the CIGAR string and calculate the mean sequence identity of the alignments. Finally, we selected -s 2500 as the minimal peak dynamic programming alignment score. The default setting for this parameter is 40, and we tested that one as well as 1000, 2500 and 5000. We found that with -s 40 and -s 1000, spurious alignments occurred from other centromeres, and with -s 5000, accurate alignments from centromeres were filtered out. We therefore chose -s 2500 to allow for diverse α-satellite HOR structures to align without some alignments being filtered out. After generating the alignments, we filtered them using SAMtools⁵⁹ (v.1.9) flag 4, which keeps primary and partial alignments. We subsequently partitioned the alignments into 10 kb non-overlapping windows in the reference genome (either CHM1 or CHM13) and calculated the mean sequence identity between the pairwise alignments in each window with the following formula: (number of matches)/(number of matches + number of mismatches + number of insertion events + number of deletion events). We then averaged the sequence identity across the 10 kb windows within the α-satellite HOR array(s), monomeric/diverged α-satellites, other satellites and non-satellites for each chromosome to determine the mean sequence identity in each region.

In the second analysis, we first fragmented the centromeric contigs from each genome assembly into 10 kb fragments with seqtk (v.1.3; https://github.com/lh3/seqtk) and subsequently aligned them to the reference genome (either CHM1 or CHM13) using minimap2⁵⁰ (v.2.24) and the following command: minimap2 -I 15G -K 8G -t {threads} -ax asm20 –secondary=no –eqx -s 40 {ref.fasta} {query.fasta}. We filtered the alignments using SAMtools⁵⁹ (v.1.9) flag 4, which keeps primary and partial alignments. In this method, multiple 10 kb fragments are allowed to align to the same region in the reference genome, but each 10 kb fragment is only allowed to align once. We then partitioned the alignments into 10 kb non-overlapping windows in the reference genome and calculated the mean sequence identity between all alignments in each window as described above. We averaged the sequence identity across the 10 kb windows within the α-satellite HOR array(s), monomeric/diverged α-satellites, other satellites and non-satellites for each chromosome to determine the mean sequence identity in each region.

In the third analysis, we first identified the location of the α-satellite HOR array(s) in each genome assembly using RepeatMasker⁶⁵ (v.4.1.0) followed by HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) and subsequently extracted regions enriched with ‘live’ α-satellite HORs (denoted with an ‘L’ in the HumAS-HMMER BED file). We then ran TandemAligner⁶⁶ (v.0.1) on pairs of complete centromeric HOR arrays using the following command: tandem_aligner –first {ref.fasta} –second {query.fasta} -o {output_directory}. We parsed the CIGAR string generated by TandemAligner by first binning the alignments into 10 kb non-overlapping windows and calculating the mean sequence identity in each window as described above. As TandemAligner is only optimized for tandem repeat arrays, we assessed the sequence identity only in the α-satellite HOR array(s) of each centromeric region and did not use it to assess the sequence identity in any other region.

Better-match analysis

To determine whether the CHM1 or CHM13 centromeres are a better match to those from the 56 diverse human genomes assembled by the HPRC¹⁰ and HGSVC²³, we performed a pairwise sequence alignment between contigs from the HPRC and HGSVC assemblies to either the CHM1 or CHM1 assembly using minimap2⁵⁰ (v.2.24) and the following command: minimap2 -I 15G -K 8G -t {threads} -ax asm20 –secondary=no –eqx -s 2500 {ref.fasta} {query.fasta}. We filtered the alignments using SAMtools⁵⁹ (v.1.9) flag 4, which keeps primary, secondary and partial alignments, and then calculated an alignment score between each pair of haplotypes, limiting our analysis to only the centromeric α-satellite HOR arrays as follows: (total number of aligned bases in the query)/(total number of bases in the reference) × (mean sequence identity by event). The mean sequence identity by event is calculated as follows: (number of matches)/(number of matches + number of mismatches + number of insertion events + number of deletion events). The set of centromeres with a higher alignment score was determined to be a better match to that haplotype than the other set of centromeres.

Pairwise sequence identity heat maps

To generate pairwise sequence identity heat maps of each centromeric region, we ran StainedGlass⁴⁴ (v.6.7.0) with the following parameters: window=5000 mm_f=30000 mm_s=1000. We normalized the colour scale across the StainedGlass plots by binning the percentage of sequence identities equally and recolouring the data points according to the binning. To generate heat maps that show only the variation between centromeric regions, we ran StainedGlass⁴⁴ (v.6.7.0) with the following parameters: window=5000 mm_f=60000 mm_s=30000. As above, we normalized the colour scale across the StainedGlass plots by binning the percentage of sequence identities equally and recolouring the datapoints according to the binning.

Estimation of α-satellite HOR array length

To estimate the length of the α-satellite HOR arrays of each centromere in the CHM1, CHM13 and 56 diverse genome assemblies^10,23, we first ran RepeatMasker⁶⁵ (v.4.1.0) on the assemblies and identified contigs containing α-satellite repeats, marked by ‘ALR/Alpha’. We extracted these α-satellite-containing contigs and ran HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) on each of them. HumAS-HMMER is a tool that identifies the location of α-satellite HORs in human centromeric sequences. It uses a hidden Markov model (HMM) profile for centromeric α-satellite HOR monomers and generates a BED file with the coordinates of the α-satellite HORs and their classification. Using this BED file, we extracted contigs containing α-satellite HORs that were designated as live or active (denoted with an ‘L’ in the HumAS-HMMER BED file), which are those that belong to an array that consistently associates with the kinetochore in several individuals^5,67. By contrast, dead or inactive α-satellite HORs (denoted with a ‘d’ in the HumAS-HMMER BED file) are those that have not been found to be associated with the kinetochore and are usually more divergent in sequence than the live or active arrays. We filtered out contigs that had incomplete α-satellite HOR arrays (such as those that did not traverse into unique sequence), thereby limiting our analysis to only complete α-satellite HOR arrays. Moreover, we assessed the integrity of each of the α-satellite HOR array-containing contigs using NucFreq²² to ensure that they were completely and accurately assembled, filtering out those with evidence of a deletion, duplication or misjoin. Finally, we calculated the length of the α-satellite HOR arrays in the remaining contigs by taking the minimum and maximum coordinate of the ‘live’ α-satellite HOR arrays and plotting their lengths with GraphPad Prism (v.9.5.1).

Sequence composition and organization of α-satellite HOR arrays

To determine the sequence composition and organization of each α-satellite HOR array in the CHM1, CHM13 and 56 diverse genome assemblies^10,23, we ran HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) on centromeric contigs with the default parameters and parsed the resulting BED file with StV (https://github.com/fedorrik/stv). This generated a BED file with each α-satellite HOR sequence composition and its organization along the α-satellite HOR arrays. We used the stv_row.bed file to visualize the organization of the α-satellite HOR arrays with R⁶⁸ (v.1.1.383) and the ggplot2 package⁶⁶. The α-satellite monomer and HOR classification generated with HumAS-HMMER is described in detail in the supplementary information of a previous study⁵, in which a more complete description of these annotations can be found.

CpG methylation analysis

To determine the CpG methylation status of each CHM1 centromere, we aligned CHM1 ONT reads >30 kb in length to the CHM1 whole-genome assembly using Winnowmap⁵¹ (v.1.0) and then assessed the CpG methylation status of the centromeric regions with Nanopolish⁶⁹ (v.0.13.3). Nanopolish distinguishes 5-methylcytosines from unmethylated cytosines via a HMM on the raw nanopore current signal. The methylation caller generates a log-likelihood value for the ratio of probability of methylated to unmethylated CpGs at a specific k-mer. We filtered methylation calls using the nanopore_methylation_utilities tool⁷⁰ (https://github.com/isaclee/nanopore-methylation-utilities), which uses a log-likelihood ratio of 2.5 as a threshold for calling methylation. CpG sites with log-likelihood ratios greater than 2.5 (methylated) or less than −2.5 (unmethylated) are considered to be high quality and are included in the analysis. Reads that do not have any high-quality CpG sites are filtered from the BAM for subsequent methylation analysis. Nanopore_methylation_utilities integrates methylation information into the BAM file for viewing in IGV’s⁵² bisulfite mode, which was used to visualize CpG methylation. To determine the size of hypomethylated region (termed the CDR³¹) in each centromere, we developed a novel tool, CDR-Finder (https://github.com/arozanski97/CDR-Finder). This tool first bins the assembly into 5 kb windows, computes the median CpG methylation frequency within windows containing α-satellite (as determined by RepeatMasker⁶⁵ (v.4.1.0), selects bins that have a lower CpG methylation frequency than the median frequency in the region, merges consecutive bins into a larger bin, filters for merged bins that are >50 kb and reports the location of these bins.

Native CENP-A ChIP–seq and analysis

To determine the location of centromeric chromatin within the CHM1 genome, we performed two independent replicates of native CENP-A chromatin immunprecipitation–sequencing (ChIP–seq) analysis of CHM1 cells as described previously²¹, with some modifications. In brief, 3–4 × 10⁷ cells were collected and resuspended in 2 ml of ice-cold buffer I (0.32 M sucrose, 15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl₂, 0.1 mM EGTA and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher Scientific, 78429)). Then, 2 ml of ice-cold buffer II (0.32 M sucrose, 15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl₂, 0.1 mM EGTA, 0.1% IGEPAL and 2× Halt Protease Inhibitor Cocktail) was added, and the samples were placed onto ice for 10 min. The resulting 4 ml of nuclei was gently layered on top of 8 ml of ice-cold buffer III (1.2 M sucrose, 60 mM KCl, 15 mM, Tris pH 7.5, 15 mM NaCl, 5 mM MgCl₂, 0.1 mM EGTA and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher Scientific, 78429)) and centrifuged at 10,000g for 20 min at 4 °C. Pelleted nuclei were resuspended in buffer A (0.34 M sucrose, 15 mM HEPES, pH 7.4, 15 mM NaCl, 60 mM KCl, 4 mM MgCl₂ and 2× Halt Protease Inhibitor Cocktail) to 400 ng ml⁻¹. Nuclei were frozen on dry ice and stored at 80 °C. MNase digestion reactions were performed on 200–300 μg chromatin, using 0.2–0.3 U μg⁻¹ MNase (Thermo Fisher Scientific, 88216) in buffer A supplemented with 3 mM CaCl₂ for 10 min at 37 °C. The reaction was quenched with 10 mM EGTA on ice and centrifuged at 500g for 7 min at 4 °C. The chromatin was resuspended in 10 mM EDTA and rotated at 4 °C for 2 h. The mixture was adjusted to 500 mM NaCl, rotated for another 45 min at 4 °C and then centrifuged at maximum speed (21,100g) for 5 min at 4 °C, yielding digested chromatin in the supernatant. Chromatin was diluted to 100 ng ml⁻¹ with buffer B (20 mM Tris, pH 8.0, 5 mM EDTA, 500 mM NaCl and 0.2% Tween-20) and precleared with 100 μl 50% protein G Sepharose bead (Abcam, ab193259) slurry for 20 min at 4 °C with rotation. Precleared supernatant (10–20 μg bulk nucleosomes) was saved for further processing. To the remaining supernatant, 20 μg mouse monoclonal anti-human CENP-A antibody (3-19; Enzo, ADI-KAM-CC006-E; approximately a 1:80 dilution) was added and rotated overnight at 4 °C. Immunocomplexes were recovered by the addition of 200 ml 50% protein G Sepharose bead slurry followed by rotation at 4 °C for 3 h. The beads were washed three times with buffer B and once with buffer B without Tween-20. For the input fraction, an equal volume of input recovery buffer (0.6 M NaCl, 20 mM EDTA, 20 mM Tris, pH 7.5 and 1% SDS) and 1 ml of RNase A (10 mg ml⁻¹) was added, followed by incubation for 1 h at 37 °C. Proteinase K (100 mg ml⁻¹, Roche) was then added, and the samples were incubated for another 3 h at 37 °C. For the ChIP fraction, 300 μl of ChIP recovery buffer (20 mM Tris, pH 7.5, 20 mM EDTA, 0.5% SDS and 500 mg ml⁻¹ proteinase K) was added directly to the beads and incubated for 3–4 h at 56 °C. The resulting proteinase-K-treated samples were subjected to a phenol–chloroform extraction followed by purification using the Qiagen MinElute PCR purification column. Unamplified bulk nucleosomal and ChIP DNA was analysed using an Agilent Bioanalyzer instrument and a 2100 High Sensitivity Kit.

Sequencing libraries were generated using the TruSeq ChIP Library Preparation Kit, Set A (Illumina, IP-202-1012) according to the manufacturer’s instructions, with some modifications. In brief, 5–10 ng bulk nucleosomal or ChIP DNA was end-repaired and A-tailed. Illumina TruSeq adaptors were ligated, libraries were size-selected to exclude polynucleosomes using an E-Gel SizeSelect II agarose gel and the libraries were PCR-amplified using the PCR polymerase and primer cocktail provided in the kit. The resulting libraries were submitted for 150 bp, paired-end Illumina sequencing using the NextSeq 500/550 High Output Kit v2.5 (300 cycles). The resulting reads were assessed for quality using FastQC (https://github.com/s-andrews/FastQC), trimmed with Sickle (v.1.33; https://github.com/najoshi/sickle) to remove low-quality 5′- and 3′-end bases, and trimmed using Cutadapt⁷¹ (v.1.18) to remove adapters.

Processed CENP-A ChIP and bulk nucleosomal reads were aligned to the CHM1 whole-genome assembly using BWA-MEM⁷² (v.0.7.17) with the following parameters: bwa mem -k 50 -c 1000000 {index} {read1.fastq.gz} {read2.fastq.gz}. The resulting SAM files were filtered using SAMtools⁵⁹ (v.1.9) with flag score 2308 to prevent multi-mapping of reads. With this filter, reads mapping to more than one location are randomly assigned a single mapping location, thereby preventing mapping biases in highly identical regions. Alignments were normalized and filtered with deepTools⁷³ (v.3.4.3) bamCompare with the following parameters: bamCompare -b1 {ChIP.bam} -b2 {bulk_nucleosomal.bam} –operation ratio –binSize 1000 –minMappingQuality 1 -o {out.bw}. Alternatively, CENP-A ChIP–seq data alignments were filtered using a marker-assisted mapping strategy as described previously⁵. In brief, unique 51-mers in the CHM1 whole-genome assembly were counted and filtered with meryl⁵³ (v.1.3). The locations of the unique 51-mers were identified with meryl⁵³ (v.1.3) and then used to filter the CENP-A ChIP–seq and input alignments using BEDtools⁶⁴ intersect (v.2.29.0). Alignments were normalized and filtered with deepTools⁷³ (v.3.4.3) bamCompare with the following parameters: bamCompare -b1 {ChIP.bam} -b2 {bulk_nucleosomal.bam} –operation ratio –binSize 1000 -o {out.bw}.

Estimation of the length of the kinetochore sites

To estimate the length of the CHM1 and CHM13 kinetochore sites, we first determined the CpG methylation status of each CHM1 and CHM13 centromere using the approach described above (see the ‘CpG methylation analysis’ section). We then mapped the CENP-A ChIP–seq data from each genome to the same source genome using the mapping parameters described above (see the ‘Native CENP-A ChIP–seq and analysis’ section). We next used CDR-Finder (https://github.com/arozanski97/CDR-Finder) to identify the location of hypomethylated regions within the centromeres, and we filtered the hypomethylated regions that had less than tenfold enrichment of CENP-A ChIP–seq reads relative to the bulk nucleosomal reads. We reported the lengths of the hypomethylated regions enriched with CENP-A as determined with CDR-Finder, and we tested for statistical significance using a two-sided Kolmogorov–Smirnov test with GraphPad Prism (v.9.5.1).

Immuno-FISH on stretched metaphase chromosome spreads

Mechanically stretched metaphase spreads were obtained from the CHM1 cell line according to established procedures⁷⁴. In brief, colcemid-treated cells were washed in phosphate-buffered saline (1× PBS), counted, and resuspended for 15 min in a hypotonic buffer HCM (10 mM HEPES, pH 7.3, 1 mM glycerol, 1 mM CaCl₂ and 0.8 mM MgCl₂) to achieve a final concentration of 10,000 cells per ml. Then, 0.5 ml of the cell suspension was cytocentrifuged onto glass slides at 2,000 rpm for 8 min with a Shandon Cytospin 3 and fixed in methanol at −20 °C for 15 min and in methanol:acetic acid 3:1 at −20 °C for 30 min. The slides were aged overnight at room temperature.

Immunofluorescence was performed on the stretched metaphase chromosome spreads using an in-house rabbit polyclonal CENP-C antibody as previously described with minor modifications⁷⁵. In brief, each slide was rehydrated by immersion in 1× PBS-azide (10 mM NaPO₄, pH 7.4, 0.15 M NaCl, 1 mM EGTA and 0.01% NaN₃) for 15 min at room temperature. Chromosomes were then swollen by washing the slides (three times, 2 min each) with 1× TEEN (1 mM triethanolamine-HCl, pH 8.5, 0.2 mM NaEDTA, and 25 mM NaCl), 0.5% Triton X-100 and 0.1% BSA. The primary polyclonal antibody against the centromeric protein CENP-C was diluted 1:40 in the same solution and then added (100 μl) onto the slides. Each slide was incubated for 2 h at 37 °C. Excess of primary antibody was removed by washing the slides at room temperature (three times, 2, 5 and 3 min each) with 1× KB buffer (10 mM Tris-HCl, pH 7.7, 0.15 M NaCl and 0.1% BSA). A goat anti-rabbit IgG secondary antibody conjugated to FITC (Sigma-Aldrich, F0382) was diluted 1:40 in the same solution, and 100 μl was then added to the slides that were then incubated for 45 min at 37 °C in a dark chamber. After incubation with the secondary antibody, the slides were washed once with 1× KB for 2 min, prefixed with 4% paraformaldehyde in 1× KB for 45 min at room temperature, washed with distilled H₂O by immersion for 10 min at room temperature, and fixed with methanol and acetic acid (3:1) for 15 min. FISH was then performed using two α-satellite-containing plasmids (pZ21A and pGA16) directly labelled by nick-translation with Cy3-dUTP (Enzo, 42501) according to a standard procedure with minor modifications⁷⁶. In brief, 300 ng of labelled probe was used for the FISH experiments; DNA denaturation was performed at 70 °C for 4 min and hybridization at 37 °C in 2× SSC, 50% (v/v) formamide, 10% (w/v) dextran sulphate, 3 μg Cot-1 DNA and 3 mg sonicated salmon sperm DNA, in a volume of 10 μl. Post-hybridization washing was performed under high stringency conditions: at 60 °C in 0.1× SSC (three times, 5 min each). Nuclei and chromosome metaphases were simultaneously DAPI-stained. Digital images were obtained using a Leica DMRXA2 epifluorescence microscope equipped with a cooled CCD camera (Princeton Instruments). DAPI, Cy3 and fluorescein fluorescence signals, detected with specific filters, were recorded separately as grayscale images. Pseudocolouring and merging of images were performed using ImageJ (v.1.53k).

Human and NHP α-satellite SF classification and strand orientation analysis

Human and NHP α-satellite monomers are grouped into 20 distinct SF classes based on shared sequence identity and structure, which is described in detail previously⁵. The SF classes and their monomers are as follows: SF1 (J1 and J2), SF01 (J3, J4, J5 and J6), SF2 (D2, D2, FD), SF02 (D3, D4, D5, D6, D7, D8 and D9), SF3 (W1, W2, W3, W4 and W5), SF4 (Ga), SF5 (R1 and R2), SF6 (Ha), SF7 (Ka), SF8 (Oa and Na), SF9 (Ca), SF10 (Ba), SF11 (Ja), SF12 (Aa), SF13 (Ia), SF14 (La), SF15 (Fa), SF16 (Ea), SF17 (Qa), SF18 (Pa and Ta). To determine the α-satellite SF content and strand orientation of human and NHP centromeres, we ran HumAS-HMMER (https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) on centromeric contigs with the following command: hmmer-run_SF.sh {path_to_directory_with_fasta} AS-SFs-hmmer3.0.290621.hmm {number_of_threads}. This generated a BED file with the SF classification and strand orientation of each α-satellite monomer, which we visualized with R⁶⁸ (v.1.1.383) using the ggplot2 package⁶⁶. In cases in which an inversion was detected, we ran StringDecomposer⁷⁷, a tool that detects and reports changes in orientation of tandem repeats, using the default parameters to confirm the presence of reoriented α-satellite monomers at the breakpoints. Finally, we validated the presence of the inversion by aligning native ultra-long ONT reads to the assemblies as described above and confirming even coverage across the breakpoints as well as the presence of inverted α-satellite monomers in the aligned reads.

We uploaded the α-satellite SF and strand orientation tracks generated by HumAS-HMMER for each centromere assembly to the UCSC Human Genome Browser. For the CHM1 centromeres, we uploaded two additional tracks: one showing each α-satellite monomer belonging to known human HORs (ASat-HOR track) and another showing structural variation in human HORs (StV track). All tracks were built and colour-coded as described previously⁵ and are publicly available online (https://genome.ucsc.edu/s/fedorrik/chm1_cen (CHM1); https://genome.ucsc.edu/s/fedorrik/T2T_dev (CHM13); https://genome.ucsc.edu/s/fedorrik/cen_primates (chimpanzee, orangutan, and macaque)). Note that the SF annotation coverage in macaque is sometimes discontinuous (some monomers are not annotated due to significant divergence of macaque dimers from their progenitor Ka class monomers). However, most monomers are identified as Ka, which indicates SF7. In orangutan centromeres, most monomers are identified as R1 and R2, which indicates SF5. In chimpanzee and human autosome and X chromosome centromeres, active arrays are formed by J1 and J2 (SF1), D1, FD and D2 (SF2), and W1–W5 (SF3) monomers. The only exception uncovered in this paper is the centromere of chimpanzee chromosome 5, which appears to be formed by R1 and R2 (SF5), with some monomers identified as J4 and Ga. The former belongs to SF01, which represents the generation of α-satellite intermediate between the progenitor SF5 and the more derived SF1, and J4 is particularly close to the R1 monomer. Moreover, the other SF01 monomers, such as J3, J5 and J6, are absent in the array, which indicates that it is not genuine SF01. Thus, the J4 monomer in chimpanzee centromere 5 should be considered variant R1. Similarly, occasional Ga monomers belong to SF4, which is the direct progenitor of SF5, and Ga is very close to R2. Ga monomers dispersed in the SF5 array are therefore just misclassed R2 monomers. The whole chimpanzee chromosome 5 α-satellite HOR array should therefore be classified as SF5, despite the abovementioned contaminations.

Human and NHP phylogenetic analysis

Humans, chimpanzees, orangutans and macaques diverged over a period of at least 25 million years, with chimpanzees diverging approximately 6 million years ago²⁹, orangutans 12–16 million years ago²⁹ and macaques ~25 million years ago⁷⁸. Despite these divergence times, all primates retain α-satellite repeats, which permit the phylogenetic analysis of these regions and an estimation of their evolutionary trajectory. To assess the phylogenetic relationship between α-satellite repeats in human and NHP genomes, we first masked every non-α-satellite repeat in the CHM1, CHM13, HG00733, chimpanzee, orangutan and macaque centromere assemblies using RepeatMasker⁶⁵ (v.4.1.0). We then subjected the masked assemblies to StringDecomposer⁷⁷ using α-satellite monomers derived from the T2T-CHM13 reference genome⁴ (v.2.0). This tool identifies the location of α-satellite monomers in the assemblies, and we used this to extract the α-satellite monomers from the HOR/dimeric array and monomeric regions into multi-FASTA files. We randomly selected 100 and 50 α-satellite monomers from the HOR/dimeric array and monomeric regions, respectively, and aligned them with MAFFT^79,80 (v.7.453). We used IQ-TREE⁸¹ (v.2.1.2) to reconstruct the maximum-likelihood phylogeny with model selection and 1,000 bootstraps. The resulting tree file was visualized in iTOL⁸².

To estimate sequence divergence along the pericentromeric regions, we first mapped each NHP centromere assembly to the CHM13 centromere assembly using minimap2⁵⁰ (v.2.17-r941) with the following parameters: -ax asm20 –eqx -Y -t 8 -r 500000. We then generated a BED file of 10 kb windows located within the CHM13 centromere assembly. We used the BED file to subset the BAM file, which was subsequently converted into a set of FASTA files. FASTA files contained at least 5 kb of sequence from one or more NHP centromere assemblies mapping to orthologous chromosomes. Pairs of human and NHP sequences were realigned using MAFFT^79,80 (v.7.453) with the following command: mafft –maxiterate 1000 –localpair. Next, we calculated the SNV density and Ti/Tv ratios from these alignments, limiting our analysis to only those regions with one-to-one unambiguous mapping and excluding segmental duplications and satellite repeats (Supplementary Table 10). As a control, we also calculated the SNV density and Ti/Tv ratios from 500 uniquely mapping regions across the genomes (Supplementary Table 11). We estimated the sequence divergence using the Tamura-Nei substitution model⁸³, which accounts for recurrent mutations and differences between transversions and transitions as well as within transitions. The mutation rate per segment was estimated using Kimura’s model of neutral evolution⁸⁴. In brief, we modelled the estimated divergence (D) as a result of between-species substitutions and within-species polymorphisms, that is:

$$D=2\mu t+4{N}_{{\rm{e}}}\,\mu $$

where N_e is the ancestral human effective population size, t is the divergence time for a given human–NHP pair and μ is the mutation rate. We assumed a generation time of [20, 29] years and the following divergence times: human–macaque = [23 × 10⁶, 25 × 10⁶] years, human–orangutan = [12 × 10⁶, 14 × 10⁶] years, human–chimpanzee = [4 × 10⁶, 6 × 10⁶] years. To convert the genetic unit to a physical unit, our computation also assumes N_e = 10,000 and uniformly drawn values for the generation and divergence times.

Human-specific phylogenetic analysis

To determine the phylogenetic relationship and divergence times between centromeric regions from chromosomes 5, 7 and 10–14 in the CHM1, CHM13 and 56 other diverse human genomes (sequenced and assembled by the HPRC¹⁰ and HGSVC²³), we first identified contigs with complete and accurately assembled centromeric α-satellite HOR arrays, as determined by RepeatMasker⁶⁵ (v.4.1.0) and NucFreq²² analysis. We then aligned each of these contigs to the T2T-CHM13 reference genome⁴ (v.2.0) using minimap2⁵⁰ (v.2.24). We also aligned the chimpanzee whole-genome assembly to the T2T-CHM13 reference genome⁴ (v.2.0) to serve as an outgroup in our analysis. We identified 20 kb regions in the flanking monomeric α-satellite or unique regions on the p- or q-arms and ensured that the region we had selected had only a single alignment from each haplotype to the reference genome. We next aligned these regions to each other using MAFFT^79,80 (v.7.453) with the following command: mafft –auto –thread {num_of_threads} {multi-fasta.fasta}. We used IQ-TREE⁸¹ (v.2.1.2) to reconstruct the maximum-likelihood phylogeny with model selection and 1,000 bootstraps. The resulting tree file was visualized in iTOL⁸². Timing estimates were calculated by applying a molecular clock based on the branch-length distance to individual nodes and assuming a divergence time between human and chimpanzee of 6 million years ago. Clusters of α-satellite HOR arrays with a single monophyletic origin were assessed for gains and losses of α-satellite base pairs, monomers, HORs and distinct structural changes manually.

Polymorphic TE analysis

To detect polymorphic TEs between the CHM1 and CHM13 centromeric regions, we first ran RepeatMasker⁶⁵ (v.4.1.0) on the CHM1 and CHM13 centromeric regions. We then masked all satellite repeats within these regions using BEDtools⁶⁴ maskfasta (v.2.29.0). We aligned the masked CHM1 fasta to the masked CHM13 fasta using minimap2⁵⁰ and the following command: minimap2 -t {threads} –eqx -c -x asm20 –secondary=no {ref.fasta} {query.fasta}. Using the resulting PAF, we extracted the regions with structural variants that were >50 bp long. We next intersected these regions with the RepeatMasker annotation file to identify those variants that overlapped SINE, LINE or LTR repeat classes by >75%. We considered the following LINE and SINE subgroups: LINE/CR1, LINE/L1, LINE/L1-Tx1, LINE/L2, LINE/Penelope, LINE/RTE-BovB, LINE/RTE-X, SINE/5S-Deu-L2, SINE/Alu, SINE/MIR, SINE/tRNA, SINE/tRNA-Deu, SINE/tRNA-RTE. We then determined the variation in length of these regions between the two centromeric regions, and we plotted their position and length using R⁶⁸ (v.1.1.383) and the ggplot2 package⁶⁶.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

Selfish conflict underlies RNA-mediated parent-of-origin effects

[ad_1]

Maintenance of worm strains

Nematodes were grown on modified nematode growth medium (NGM) plates with 1% agar/0.7% agarose to prevent C. tropicalis burrowing. Experiments were conducted at either 25 °C (C. tropicalis) or 20 °C (C. elegans). csr-1(+/−) strains were cultured on 6-cm NGM plates supplemented with 500 μl of G418 (25 mg ml⁻¹) for selecting heterozygous null individuals. Supplementary Table 2 lists all study strains, some of which were provided by the Caenorhabditis Genetics Centre, funded by the NIH Office of Research Infrastructure Programs (P40 OD010440).

Phenotyping and genotyping of crosses

For crosses, 4–5 L4 hermaphrodites were mated with 30–40 males in a 12-well plate with modified NGM. After 2 days, 10 L4 F₁ progeny were transferred to separate plates, genotyped by PCR, and at least 10 embryos per F₁ hermaphrodite were singled into 6-cm NGM plates. Each F2 individual was visually inspected daily for up to 7 days, classified for developmental stage, and any phenotypic abnormalities. Embryonic lethality, arrested development, and delayed reproduction were assessed. Sterility was noted for adults not producing progeny. After 7 days, worms were lysed and genotyped. A list of primers used for genotyping can be found in Supplementary Table 3. Crosses involving csr-1(−); slow-1/grow-1 hermaphrodites vs EG6180 males or injected hermaphrodites vs NIL males were selected based on a pmyo-2::mScarlet reporter.

Generation of C. tropicalis transgenic lines

For CRISPR–Cas gene editing, we adapted previous protocols⁵³. In brief, 250 ng µl⁻¹ Cas9 or Cas12a proteins were incubated with 200 ng µl⁻¹ CRISPR RNA (crRNA) and 333 ng µl⁻¹ trans-activating crRNA (tracrRNA) before adding 2.5 ng µl⁻¹ co-injection marker plasmid (pCFJ90-mScarlet-I). For HDR, donor oligos (IDT) or biotinylated and melted PCR products were added at a final concentration of 200 ng µl⁻¹ or 100 ng µl⁻¹, respectively. Following injections into young hermaphrodites, mScarlet-positive F1 were singled, and their offspring screened by PCR and Sanger sequencing to detect successful editing. To clone the mScarlet::SLOW-1 donor, we added ~300-bp homology arms amplified from QX2345 genomic DNA to mScarlet-I (from pMS050) in pBluescript via Gibson assembly. Because csr-1 is essential for viability in C. elegans, we first devised a strategy to stably propagate a csr-1 heterozygous line in the absence of classical genetic balancers. To do so, we used CRISPR–Cas9 to introduce a premature stop mutation in the endogenous csr-1 locus followed by a neoR cassette, which confers resistance to the G418 antibiotic (Extended Data Fig. 6d). For the csr-1::neoR donor, we first replaced the C. elegans rps-27 promoter and unc-54 3′ UTR in pCFJ910 with 500 bp upstream and 250 bp downstream of the C. tropicalis rps-20 gene. This rps-20::neoR cassette was then flanked with ~550-bp homology arms amplified from EG6180 worms and inserted into pBluescript. Correct targeting introduces a stop codon after residue L337 of CSR-1 followed by a ubiquitously expressed neomycin resistance. We propagated the mutant line in plates containing G418 and thus actively selecting for heterozygous csr-1(−) null individuals. Upon drug removal, most homozygous csr-1(−) individuals derived from heterozygous mothers developed into adulthood but were either sterile or laid mostly dead embryos. However, a small fraction of null mutants was partially fertile and homozygous csr-1(−) lines could be stably propagated for multiple generations despite extensive embryonic lethality in the population (Extended Data Fig. 6d). All gRNAs and HDR templates are available on Supplementary Tables 4 and 5.

In vitro RNA transcription and injection

The slow-1 cDNA was cloned into pGEM-T Easy (Promega, A1360), with a 5′ T7 RNA polymerase site and the start codon mutated RNA-only transcription (ATG>TTG). The plasmid was digested with NotI to release the insert (NEB, R0189), which was subsequently purified by gel-extraction and used as template for RNA synthesis. RNA was prepared using the HiScribe T7 Quick High Yield kit (NEB, E2050) with the following modifications: addition of 3 µl of 10 mM DTT and 1 µl of RNaseOUT (Thermo, 10777019). After overnight transcription, the reaction was diluted, treated with RNase-free DNase I (NEB, M0303S), bead-purified (Vienna Biocenter MBS 5001111, High Performance RNA Bead Isolation), quantified (Thermo, Q32852), and stored at −80 °C. Injections were repeated twice using independently transcribed RNA at concentrations: 150 nM and 400 nM yielding identical results.

Reciprocal crosses with the mScarlet::slow-1 reporter line

To assess SLOW-1 expression in F₁ progeny from reciprocal crosses between mScarlet::SLOW-1 NIL and EG6180 strains, we conducted 2 sets of crosses: (1) SLOW-1::mScarlet dpy (INK461) hermaphrodites to EG6180 males for maternal inheritance; and (2) EG6180 dpy (QX2355) hermaphrodites to mScarlet::SLOW-1 NIL males (INK459) for paternal inheritance. Wild-type young adult F₁ progeny were immobilized in NemaGel on a glass slide and imaged using an Axio Imager.Z2 (Carl Zeiss) widefield microscope with a Hamamatsu Orca Flash 4 camera, (excitation 545/30 nm filter). The analysis was performed in FIJI, by tracing the germline in the DIC channel and measuring mean fluorescence, including gut autofluorescence.

Sequencing and genome assembly of EG6180

We extracted high molecular weight genomic DNA using the Masterpure Complete DNA and RNA purification kit (tissue sample protocol, Lucigen). We prepared 8 kb, 20 kb and unfragmented sequencing libraries using the 1D Ligation Sequencing Kit (Oxford Nanopore SQK-LSK109). The 8 kb fragmentation was done using g-TUBE (Covaris). Library was loaded on a MinION MK1B device (Oxford Nanopore). Read calling was done using MinKNOW software. We performed a hybrid assembly, incorporating Illumina sequencing reads of EG6180 with some modifications as detailed below⁹. We used assembled Illumina reads to correct raw Nanopore reads, which were assembled using Flye Assembler⁵⁴. The preliminary assembly included 119 contigs in 107 scaffolds (Scaffold N50 was 1,489,504 bp). We derived synteny blocks between the provisional assembly and our chromosome-level NIC203 assembly using Sibelia⁵⁵ and used the synteny blocks to scaffold the contigs to chromosome level using Ragout⁵⁶.

Identification of C. tropicalis Argonaute proteins and piRNA pathway effectors

We annotated functional domains in C. tropicalis NIC203 using Interproscan 5 as part of our previous NIC203 genome assembly⁹. We identified Argonaute proteins with PFAM domains, including Piwi (PF02171), PAZ (PF02170), N-terminal domain of Argonaute (PF16486), Argonaute linker 1 (PF08699), Mid domain of Argonaute (PF16487) and Argonaute linker 2 (PF16488) domains. We excluded a protein with low molecular weight (41 kDa) as unlikely to be an Argonaute and the orthologue of C. elegans Dicer that represented an outgroup to the rest of the proteins. After aligning those sequences to C. elegans Argonautes identified in a previous study⁵⁷ using Clustal Omega we conducted phylogenetic analysis using iqtree2 (ref. ⁵⁸), with 1,000 replicates of the approximate likelihood-ratio test (–alrt 1000) and 1,000 boostraps (-b 1000). iqtree2 carries out an initial model selection step, and a substitution model with the general Q matrix, empirical codon frequencies, a proportion of invariable sites and a free rate heterogeneity (Q.pfam+F + I + R4) was selected. Additional orthologues of C. elegans piRNA effector genes were identified through reciprocal blastp searches, synteny conservation, and gene trees from Wormbase Parasite⁵⁹. C. elegans mut-16, rrf-1, and simr-1 have 1:1 orthologues in C. tropicalis. The evolutionary history of SET proteins is complex due to their propensity to gain and lose paralogues within Caenorhabditis. The gene annotated gene as C. tropicalis set-25, is the closest among six paralogues in its genome. Thus, the absence of a phenotype in the mutant may be attributed to genetic redundancy. The gene annotated as C. tropicalis set-32 is a close orthologue of two C. elegans genes: set-21 and set-32. The SET domains of C.tr-SET-32 and C.el-SET-32 are ~48% identical at the protein level. Additionally, using Alphafold2 (ref. ⁶⁰) we found that these two proteins have high structural similarity (root mean square deviation = 0.962) and using the predicted structure of C.tr-SET-32 as a query retrieved C.el-SET-32 as the top hit in C. elegans (Foldseek)⁶¹.

Transgenerational silencing of slow-1/grow-1

In the transgenerational inheritance experiments, EG6180 hermaphrodites were crossed to NIL (QX2345) males. F₁ individuals were genotyped after laying embryos to distinguish between self-progeny from cross-progeny. F₂ embryos from cross-progeny mothers were singled, allowed to lay eggs and genotyped. F₃ homozygous carriers for slow-1/grow-1 propagated for multiple generations and mated to EG6180 males. The slow-1/grow-1 TA activity was assessed by determining the proportion of delayed EG/EG non-carriers.

Single molecule in situ hybridization

Stellaris FISH Probes targeting slow-1, slow-2 and pgl-1 were designed using the Stellaris RNA FISH Probe Designer (Biosearch Technologies). The probes were labelled with Quasar 570, CAL Fluor Red 610 or Quasar 670, respectively (Biosearch Technologies). The protocol was adapted from Raj et al.⁶² and described in ref. ⁹. For imaging, an Axio Imager.Z2 (Carl Zeiss) widefield microscope with a Hamamatsu Orca Flash 4 camera and a 63×/1.4 plan-apochromat Oil DIC objective was used. Filters used were: DAPI excitation 406/15 nm, emission 457/50 nm and Quasar 570 excitation 545/30 nm, emission 610/75 nm. z-stack images with 40 slices (step size 0.2 µm) were acquired. Image analysis was performed with the FIJI plugin RS-FISH⁶³ with parameters set at Sigma 1.44, and threshold 0.0062.

RNA extraction and RNA-seq

Total RNA was extracted from approximately 100 young adult hermaphrodites and F₁ progeny, with the later using recessive mutations to visually discriminate cross-progeny from self-progeny. Reciprocal crosses were set up between parental strains for maternal or paternal inheritance of slow-1/grow-1 by mating INK531 hermaphrodites (uncoordinated worms in NIC203 background) to EG6180 males and QX2355 hermaphrodites (dumpy worms in EG6180 background) to NIC203 males and selecting phenotypically wild-type progeny for RNA extraction. Reciprocal crosses between NIL and EG6180 strains were performed analogously (INK255 hermaphrodites (dumpy worms in NIL background) to EG6180 males and QX2355 hermaphrodites (dumpy worms in EG6180 background) to QX2345 NIL males). Total RNA was extracted following a modified version of the protocol in⁶⁴ including multiple M9 washes, TRizol and chloroform incubation, phase-separation, isopropanol precipitation and resuspension in RNase-free water. Samples with RNA integrity number (RIN) > 8 were used for library preparation using the NEBNext Poly(A) kit and sequenced on NextSeq2000 P2 SR100 or NovaSeq S1 PE100 at the Vienna Biocenter NGS facility. To reduce reference bias, raw reads were aligned to a concatenated NIC203 + EG6180 genome/transcriptome assembly using STAR and bcbio-nextgen (https://github.com/bcbio/bcbio-nextgen). Transcript quantification and normalization were performed with tximport and Deseq2 (ref. ⁶⁵). We used Deseq2 to fit a model for the normalized counts using the strain identity of the mother and sequencing batch (Nextseq vs NovaSeq libraries) as fixed effects and compared the model to a null model that included only batch using a likelihood-ratio test. Despite identifying an outlier in the slow-1/grow-1 paternal inheritance samples (Fig. 1d), no obvious difference between the outlier and the other samples in terms of RNA quality and mRNA-seq quality control were identified. However, since each library was derived from an independent genetic cross, we cannot discard a human error, and therefore decided that it would be best practice to keep the outlier in the final analysis.

RT–qPCR

RNA was extracted from adult worms (50 males or 100 hermaphrodites per biological replicate) using TRIzol-chloroform extraction, followed by Dnase I digestion⁶⁶ and then RNA concentrations were measured using the Qubit High-Sensitivity RNA fluorescence kit (Thermo). cDNA was prepared with SuperScript III reverse transcriptase (Thermo) using random hexamers. Intron-spanning primers were validated with standard curves from QX2345 cDNA to ensure amplification efficiency and an r² value above 0.95. The following primers were used: FW-slow-1-mRNA: 5′-GAGCTACCGGAACTGGATAAAG-3′, RV-slow-1-mRNA: 5′-CAGAGTTCTCGGAAGTCTCCTC-3′, FW-slow-1-pre-mRNA: 5′-CGGACTGGATGAAACATTTAGC-3′, RV-slow-1-pre-mRNA: 5′-GAGCGGTGTTGACctgaatc-3′, FW-cdc-42: 5′-CGATTAAATGTGTCGTCGTAGG-3′, and RV-cdc-42: 5′-ACCGATCGTAATCTTCTTGTCC-3′. All samples had at least 3 biological replicates. We used the ∆∆C_t method to calculate relative fold change and chose cdc-42 as a housekeeping gene^67,68. Cdc-42 expression showed a low coefficient of variation in our RNA-seq datasets suggesting its validity as a housekeeping gene. All RT–qPCR reactions were prepared with the Luna Universal qPCR and RT–qPCR kit (NEB) and run with an annealing temperature of 58 °C. All biological replicates were run in technical quadruplicate and any reactions with abnormal amplification curves or melting temperatures were omitted before analysis (distinct from reactions for which we observed no amplification, which were not omitted). Representative samples from each condition were Sanger sequenced. We confirmed the absence of genomic DNA contamination in RNA samples by performing PCRs with gDNA-specific primers using the RNA as template and observed no amplification after 40 cycles. RT–qPCR indicated specific amplification of slow-1 in both hermaphrodites and males. However, the higher C_t values for males (34.27 versus 28.31 on average) and greater variability (s.d. of 1.55 versus 0.65 in the NIL) suggest much lower expression levels in males. This variability hinders a reliable estimate of abundance and assessment of the parent-of-origin effect in males.

Small RNA library preparation and sequencing

We isolated sRNAs, using the TraPR protocol⁶⁹. In brief, frozen worm pellets (2,000 worms per parental line) were supplemented with 350 µl lysis buffer, (20 mM HEPES-KOH, pH 7.9, 10% (v/v) glycerol, 1.5 mM MgCl₂, 0.2 mM EDTA, 1 mM DTT, 0.1% v/v Triton X-100). Samples were mechanically disintegrated and subjected to 4 freeze–thaw cycles in liquid nitrogen. The resulting lysates were cleared by centrifugation and the sRNA fraction was isolated using the TraPR Small RNA Isolation Kit (135.24, LEXOGEN). Isolated sRNA was treated with RppH (M0356S, BioLabs), to ensure 5′ monophosphate-independent capturing of small RNAs⁷⁰, following purification with Agencourt RNA Clean XP magnetic beads (BECKMAN COULTER). The sRNA was ligated to a 32-nt 3′ adapter with unique barcodes (sRBC, Supplementary Table 6, IDT) using truncated T4 RNA ligase 2 (M0373L, NEB). The resulting RNA was run on 12% SequaGel–UreaGel (National Diagnostics) and purified with ZR small-RNA PAGE Recovery Kit (R1070, ZYMO RESEARCH). The 37-nt-long 5′ adapter was ligated to the sRNAs using T4 RNA ligase (M0204S, NEB). The resulting RNA was cleaned up (R1015, ZYMO RESEARCH), reverse-transcribed, and PCR amplified. The cDNA fragments (160–190 nt) were extracted and gel purified (D4008, ZYMO RESEARCH). Small RNA Libraries were sequenced in triplicates on a NovaSeq S1 SR100 mode (Illumina) at the Vienna Biocenter NGS facility. All sequencing libraries generated for this project are listed in Supplementary Table 7.

sRNA immunoprecipitation

To study piRNA binding preferences of PRG-1.1 and PRG-1.2, we performed sRNA immunoprecipitation of N-terminally Flag-tagged PRG-1.1 (INK775) and PRG-1.2 (INK735) followed by sRNA-seq. For each of the 3 biological replicates (50,000 worms each), 18 worm plates (9 cm) were bleached to synchronize the population. Young adults were collected, frozen at −70 °C, thawed and washed with RIP buffer (50 mM Hepes pH 7.2, 150 mM NaCl, 0.01% NP-40). For lysis, RIP buffer and Benzonase were added and sonicated in a Diagenode Bioruptor followed by cleaning via centrifugation. For immunoprecipitation, 200 µl of Anti-Flag M2 Magnetic Beads (Millipore) were used (4 °C, overnight). The bound proteins were eluted in 500 µl 0.1 M GlycinHCl pH 2.7 for 5 min at room temperature. And transferred into a vial with 50 µl 1 M Tris-HCl pH 8. The proteins were digested with Proteinase K (0.7 mg ml⁻¹), and denatured proteins were removed by centrifugation following proteinase K inactivation. Samples were stored at −70 °C until library preparation.

Small RNA analysis

Sequencing adapters were trimmed from 5′ and 3′ ends using Cutadapt v1.18 (ref. ⁷¹). Extracted 21U and 22G reads aligned to the genome using hisat2 v2.1 (ref. ⁷²). For 22 G, only reads mapped to the coding sequences were analysed; for 21U, reads mapped to coding sequences, tRNAs and rRNAs were excluded using seqkit v0.13 and samtools v1.10. 22 G reads were quantified using featureCounts (Rsubread, R), normalized by the total number of 22 G per replicate, and visualized using the Gviz R package⁶². Candidate 21U-RNAs were identified based on perfect mapping and abundance criteria (>0.1 ppm). A custom script quantified 21U-RNAs and reads were normalized to miRNAs predicted based on homology to C. elegans miRNAs. To identify potential 21U-RNAs slow-1 candidates we used known targeting rules in C. elegans and binding energies. First, putative binding sites and energies for all 21U-RNAs against slow-1 mRNA were predicted with RNAduplex (ViennaRNA Package v2.0.58)⁶³, of which five best duplexes for every piRNA were taken. Candidate piRNAs without bubbles during binding and no more than 4 mismatches outside the seed region were extracted and ranked by binding energy (Supplementary Data 1). The second candidate list was generated considering the overall level of binding continuity by using Nucleotide blast v2.2.26 in blastn-short mode. Only 21U-RNAs with no mismatches or gaps in the seed region were selected for further analysis. Finally, we ranked 21U-RNAs by the total length of the ungapped alignment to slow-1 (Supplementary Data 1).

Chromatin immunoprecipitation

For chromatin immunoprecipitation, we collected an F₄ population of homozygous carriers for the repressed slow-1 allele after paternal inheritance, which was highly enriched in s22G-RNA complementary to slow-1 (Fig. 3i,j). First, we crossed EG6180 hermaphrodites to NIL males. The F₂ were genotyped to identify repressed slow-1/grow-1 (NIC/NIC) worms which were expanded for two generations (F₄) and collected as young adults. Each ChIP sample represents an independent genetic cross. Worms (200 µl) were collected, washed and incubated to minimize bacterial content and frozen in liquid nitrogen. For ChIP, we used the protocol described⁶⁴. Shortly the frozen worm pellet was pulverized by grinding in mortar with liquid nitrogen and the powder was crosslinked in 1 ml ice-cold RIPA buffer supplemented with 2% formaldehyde to crosslink (10 min, 4 °C). After quenching by addition of 100 µl 1 M Tris-HCl (pH 7.5), the sample was sonicated using Covaris for 600 s to achieve chromatin fragments of 200–500 bp. Fifty microlitres of the lysate was saved as an input fraction. Chromatin was immunoprecipitated using anti-H3K9me3 antibody (Ab8898, Abcam). The immunoprecipitation product was incubated with Protein A Dynabeads (Thermofisher scientific) and washed with LiCl. The immunoprecipitation product was eluted from beads and DNA was purified using ChIP DNA Clean and Concentrator kit (Zymo Research). Input control fractions were treated similarly to immunoprecipitation samples. DNA libraries were prepared with NEBNext Ultra II DNA Library Prep Kit (Illumina), deduplicated using bbmap v38.26, aligned using bwa mem v0.7.17 (ref. ⁶⁵), and normalized by the number of reads that mapped to the genome with samtools v1.10 (ref. ⁷³). Peaks were called by macs2 v2.2.5 with –broad and –mfold 1 50 options⁷⁴. Quality control plots were made using deeptools v3.3.1 (ref. ⁷⁵). H3K9me3 signal was calculated as read counts per genomic position in the ChIP sample normalized by counts in the corresponding input sample using bedtools v2.27 (ref. ⁷⁶) and custom R (v4.3) script.

Immunohistochemistry

Gravid nematodes were washed from plates, and embryos were extracted using bleach solution. The embryo suspension was applied to prepared poly-l-lysine slides (Sigma-Aldrich, P8920), and immersed into liquid nitrogen, fixed in ice-cold methanol (10 min) followed by acetone (10 min), and rehydrated in descending ethanol concentrations (95%, 70%, 50% and 30% ethanol). Fixed embryos were blocked in 3% BSA (VWR Life Science, 422351 S), followed by incubation with anti-Flag M2 primary antibody (Sigma-Aldrich, F3165, diluted 1:3,000). After washing, a secondary antibody Alexa Fluor A568 (ThermoFisher Scientific, A-11031, diluted 1:3,000) was applied, followed by additional washes. The final wash contained DAPI (Merck, D9542, 5 ng ml⁻¹). Processed embryos were mounted with Fluoroshield (Sigma-Aldrich, F6182) and imaged at Axio Imager 2 (ZEISS).

Fluorescence intensity quantification

Twenty-four-bit raw images were analysed in Fiji (v1.53r)⁷⁷. Embryos were selected by freehand tool and the same selection mask was used to capture background fluorescence intensity for each embryo. To compare fluorescence intensities between strains we used corrected total cell fluorescence (CTCF) parameter (CTCF = integrated density − (area of selected cell × mean fluorescence of background readings)). At least 23 embryos were used for quantification.

Worm protein lysate preparation and western blot

Gravid adult worms were collected, washed, and flash-frozen in the liquid nitrogen. Worm pellets were resuspended in ice-cold lysis buffer (30 mM HEPES pH 7.4, 100 mM KCl, 2 mM MgCl2, 0.05% IGEPAL, 10% glycerol and 1 tablet of protease inhibitors (Roche, 11836153001)) and lysed by sonication in Bioruptor (UCD-200, Diagenode) followed by centrifugation to obtain the supernatant. After protein quantification by Bradford assay (Thermo Scientific, 23238), samples were diluted, resuspended in SDS loading buffer, and loaded onto NuPAGE gels (Invitrogen). Samples were transferred to 0.45 µm PVDF membrane (Thermo Scientific, 88518) and blocked with 4% non-fat milk in TBS-T. Membranes were incubated with anti-Flag M2 (mouse, 1:2,000, Sigma-Aldrich, F3165) or anti-actin (rabbit, 1:3,000, Abcam, ab13772) primary antibody overnight followed by incubation with HRP-conjugated anti-mouse (1:10,000, Invitrogen, G-21040) or anti-rabbit (1:10,000, Jackson Immuno, 111-035-045) secondary antibody. Detection was performed using ECL reagent (Cytiva, RPN2106) and imaged with ChemiDoc MP (Bio-Rad). Membranes were stripped before reprobing (Thermo Scientific, 21059).

Live imaging of mScarlet::SLOW-1

Approximately 20 gravid adults were dissected in M9 medium under a stereo microscope. Embryos were transferred to individual wells in a Thermo Scientific Nunc MicroWell 384-Well Optical-Bottom Plate (Thermo Scientific). Embryos were imaged using an Olympus spinning disk confocal based on an Olympus IX3 Series (IX83) inverted microscope, equipped with a dual-camera Yokogawa W1 spinning disk (Yokogawa Electric Corporation) and two ORCA-Flash 4.0 V3 Digital CMOS cameras (Hamamatsu). Each field was imaged using a 40×/0.75 NA (air) objective, 16 z-sections at 2 µm and conditions were as follows: bright-field (100% power 30 ms) 568 nm, (100% power, 500 ms). Image acquisition was performed using CellSense software (Olympus). Image processing and montages were created using Fiji and embryoCropUI⁷⁸.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

Tag: Molecular evolution

Synthetic peptides

Expression and purification of proteins

AlphaFold structure modelling

Far-UV CD spectropolarimetry

NMR spectroscopy

2D NMR lineshape analysis

CEST NMR

ZZ-exchange

Secondary chemical shifts

Transverse relaxation

Isothermal titration calorimetry

Fluorophore labelling for smFRET

Single-molecule FRET measurements and analysis

Reporting summary

Plant growth

Construct design and cloning

Rice transformation

Transactivation assay

GUS staining

Confocal microscopy

SEM

Enrichment of bundle-sheath nuclei using fluorescence-activated cell sorting

Chlorophyll quantification

Nuclei extraction and single-nucleus RNA-seq (10X RNA-seq)

Nuclei extraction and single-nucleus RNA-seq (sci-RNA-seq3)

Nuclei extraction and single-nucleus RNA-seq (10X Multiome)

Nuclei clustering

Orthology analyses

Differential expression and accessibility responses to light

GO analyses

Cis-element analyses

Reporting summary

Amylase gene naming conventions

Datasets

Determination of subsistence by population

Read-depth-based copy number genotyping

Analysis of gene expression

MAP-graph construction

Analysis of mutations at amylase genes

PGGB-based graph construction

Haplotype deconvolution approach

Linkage disequilibrium estimation

Coalescent tree, ancestral-state reconstruction and PCA

Signatures of recent positive selection in modern human populations

Inference of recent positive selection in West Eurasian populations using ancient genomes

Reporting summary

The P. breviceps sugar glider breeding colony

Scanning electron microscopy analysis

Sample acquisition, genome sequencing and genome assemblies

ATAC–seq analysis

ChIP–seq analysis

Computational analyses

Identifying GARs

Gene enrichment analysis

Conservation between petauroid marsupials

Conservation between marsupials, laboratory mice and humans

Gene Ontology and KEGG pathway analysis

Transcription-factor-binding analysis

scRNA-seq analysis of laboratory mouse data

Micro-C analysis

Generation of immortalized sugar glider fibroblasts

Luciferase assays

GAR analysis

Wnt5a–Emx2 interaction analysis

Immunostaining

Statistics and reproducibility

In vitro shRNA experiments

In-pouch lentiviral transgenesis

Phenotypic measurements

Tissue collection for qPCR and RNA-seq analysis

qPCR analysis

Bulk RNA-seq

Tissues

Cells

Whole-mount in situ hybridization analysis

HCR in situ hybridization

Stereoseq analysis

RNAscope

Mouse transgenics

Cultivation of S.
elongatus and sample preparation for metabolomics analysis

Quantification of intracellular metabolites from S.
elongatus by LC–MS/MS

Calculation of R
_g values