Tag: Structural biology

Tracking transcription–translation coupling in real time
[ad_1]
Strains and plasmids

Ribosome mutant SQ380 strains (B68 strain, 30S helical extension in h33a; C68 strain, 30S helical extension in h44; and ZS22 strain, 50S helical extension in h101) and all plasmids for initiation and elongation factors were a gift from the Puglisi laboratory^16,17. The plasmid for overexpressing initiator tRNA^fmet (pBStRNAfmetY2) was a gift from E. Schmitt⁵⁶. The plasmids for E. coli RNAP (pVS10) and σ70 (pIA586) were purchased from Addgene (#104398 and #104399, respectively)⁵⁷. The ybbR peptide tag (DSLEFIASKLA)^58,59 was cloned into the pVS10 plasmid by introducing the ybbR DNA sequence with primers, followed by 5′-end phosphorylation with T4 PNK (New England Biolabs, NEB) and simultaneous blunt ligation of the plasmid using T4 DNA ligase (NEB). For the N-terminal ybbR mutant, the tag was inserted between E15 and E16 in the β′ subunit; for the C-terminal β′ mutant, it was inserted between E1377 and G1402, thereby deleting the region A1378–L1401. Gene sequences of ribosomal protein S1, NusA, NusG, NusG(NTD) (M1–P124), NusG(F165A), methionyl-tRNA^fmet formyltransferase (MTF) and methionine tRNA synthetase (MetRS, truncated at K548 to obtain the monomeric form of the enzyme⁶⁰) were cloned into a pESUMO vector backbone using Gibson assembly. Gene sequences were obtained from ASKA collection plasmids⁶¹.

Sample preparation

E. coli ribosomal subunits, initiation factor IF2 and elongation factors (EF-Tu, EF-G and EF-Ts) were prepared and purified as previously described^16,17. Wild-type RNAP and mutant versions were purified as described⁵⁷. Initiator tRNA^fmet was prepared and purified following the protocol as described⁶². S1, NusA, NusG and NusG mutants were overexpressed as N-terminal fusions with His₆–SUMO. They were purified by lysing the cells in IMAC buffer (50 mM Tris-HCl pH 8.0, 300 mM NaCl, 10–20 mM imidazole, 10 mM 2-mercaptoethanol) using a microfluidizer, loading the clarified lysate on 5 ml HisTrap HP columns (Cytiva) and eluting with increasing imidazole concentrations over 20 column volumes. Protein fractions were pooled and the fusion tag was cleaved overnight with His-tagged Ulp1 protease, while dialysing against IMAC buffer without imidazole. The cleaved tag was removed by a second HisTrap purification. Protein fractions were pooled, concentrated using 15 ml Amicon Ultracel 10 or 30 K concentrators and further purified via size exclusion chromatography using either HiLoad S75 (NusG/NusG(NTD)/NusG(F165A)) or HiLoad S200 (NusA, S1) columns (Cytiva) equilibrated with storage buffer (S1 in 50 mM HEPES-KOH pH 7.6, 100 mM KCl, 1 mM DTT and NusA, NusG or NusG mutants in 20 mM Tris-HCl pH 7.6, 100 mM NaCl, 0.5 mM EDTA/DTT). Protein fractions were pooled and concentrated using 15 ml Amicon Ultracel 10 or 30 K concentrators to ~100 µM. Aliquots were flash-frozen with liquid nitrogen and stored at −80 °C. In the case of MTF and MetRS, the preparation was done as described above with following changes (and adapted from refs. ^62,63): (1) MTF IMAC buffer contained 10 mM K₂HPO₄/KH₂PO₄ pH 7.3, 100 mM KCl, 20 mM imidazole and 10 mM 2-mercaptoethanol; MetRS buffer contained 10 mM K₂HPO₄/KH₂PO₄ pH 6.7, 50 mM KCl, 20 mM imidazole, 100 µM ZnCl₂ and 10 mM 2-mercaptoethanol. (2) After a second HisTrap purification, MetRS containing fractions were pooled and dialysed against storage buffer (10 mM K₂HPO₄/KH₂PO₄ pH 6.7, 10 mM 2-mercaptoethanol and 50% (v/v) glycerol). Aliquots were flash-frozen with liquid nitrogen and stored at −80 °C. (3) After a second HisTrap purification, MTF was further purified via a 5 ml HiTrap Q FF column (Cytiva) and eluted using a linear gradient of 20–100% into Q-sepharose buffer (10 mM K₂HPO₄/KH₂PO₄ pH 7.3, 500 mM KCl and 10 mM 2-mercaptoethanol). Protein fractions were pooled and dialysed against storage buffer (10 mM K₂HPO₄/KH₂PO₄ pH 6.7, 100 mM KCl, 10 mM 2-mercaptoethanol and 50% (v/v) glycerol). Aliquots were flash-frozen with liquid nitrogen and stored at −80 °C. An SDS–PAGE gel of all protein factors used in this manuscript is included in Extended Data Fig. 1j. All uncropped gels are presented in Supplementary Fig. 1.

Charging tRNA^fmet and elongator tRNAs

Typically, initiator tRNA^fmet (20 µM) was simultaneously charged and formylated in 800 µl reactions using 100 µM methionine, 300 µM 10-formyltetrahydrofolate, 200 nM MetRS and 500 nM MTF in charging buffer (50 mM Tris-HCl pH 7.5, 150 mM KCl, 7 mM MgCl₂, 0.1 mM EDTA, 2.5 mM ATP and 1 mM DTT), incubating for 5 min at 37 °C (refs. ^62,64). The fmet–tRNA^fmet was immediately purified by addition of 0.1 volumes of sodium acetate (pH 5.2), extraction with aqueous phenol (pH ~ 4) and precipitation with 3 volumes of ethanol. The tRNA pellet was solubilized in ice-cold tRNA storage buffer (10 mM sodium acetate pH 5.2, 0.5 mM MgCl₂) and further purified with a Nap-5 column (Cytiva) equilibrated with the same buffer. The eluate was aliquoted, flash-frozen with liquid nitrogen and stored at −80 °C. Elongator tRNAs were purchased (tRNA MRE600, Roche) and charged typically at 500 µM concentration in the presence of 0.2 mM amino acid mix (each), 10 mM phosphoenolpyruvate (PEP), 20% (v/v) S150 extract (prepared following ref. ⁶⁵), 0.05 mg ml⁻¹ pyruvate kinase (Roche) and 0.2 U µl⁻¹ thermostable inorganic pyrophosphatase (NEB) in total tRNA charging buffer (50 mM Tris-HCl pH 7.5, 50 mM KCl, 10 mM MgCl₂, 2 mM ATP and 3 mM 2-mercaptoethanol)²⁹. Typically, 200 µl reactions were incubated at 37 °C for 15 min and then immediately purified as described above. To remove NTP contaminations introduced by the S150 extract, the aa–tRNAs were further purified over a S200 increase column (Cytiva), pre-equilibrated in tRNA storage buffer. The eluted fractions were combined, concentrated with 2 ml Amicon Ultracel 3K concentrators, aliquoted and flash-frozen with liquid nitrogen and stored at −80 °C. Charging efficiency (typically >90%) was verified with acidic urea polyacrylamide gel electrophoresis as described⁶⁴.

Dye labelling of expressome components

Hairpin loop extensions of mutant ribosomal subunits were labelled with prNQ087–Cy3 or prNQ159–Cy3B (30S) and prNQ088–Cy5 (50S) DNA oligonucleotides, complementary to mutant helical extensions^16,17,19 (see Supplementary Table 3 for all DNA and RNA oligonucleotide sequences). Just prior to the experiments, each subunit was labelled separately at 2 µM concentration using 1.2 equivalents of the respective DNA oligonucleotide by incubation at 37 °C for 10 min and then at 30 °C for 20 min in a Tris-based polymix buffer (50 mM Tris-acetate pH 7.5, 100 mM potassium chloride, 5 mM ammonium acetate, 0.5 mM calcium acetate, 5 mM magnesium acetate, 0.5 mM EDTA, 5 mM putrescine-HCl and 1 mM spermidine). RNAP–ybbR mutants (core enzyme) were labelled by mixing 7 µM RNAP, 14 µM SFP synthetase and 28 µM CoA-Cy5 dye in a buffer containing 50 mM HEPES-KOH pH 7.5, 50 mM NaCl, 10 mM MgCl₂, 2 mM DTT and 10% (v/v) glycerol⁶⁶. Typically, 100 µl reactions were incubated at 25 °C or 37 °C for 2 h and analysed on denaturing protein gels. The holoenzyme was formed by incubating 1.11 µM Cy5-labelled RNAP with 3 equivalents σ70 for 30 min on ice in RNAP storage buffer (20 mM Tris-HCl pH 7.5, 100 mM NaCl, 0.1 mM EDTA, 1 mM DTT, 50% (v/v) glycerol). Aliquots were stored at −20 °C. SFP and free dye were removed on imaging surface prior to experiments.

DNA templates were purchased (TwistBioscience) and amplified via PCR using p0030 forward and p0075 reverse abasic primers, generating single-stranded 5′ overhangs for both DNA strands¹⁵. The fragments were purified on 2% agarose gels, extracted using a QIAGEN gel extraction kit and buffer exchanged in e55 buffer (10 mM Tris-HCl pH 7.5 and 20 mM KCl) using 0.5 ml Amicon Ultracel 30 K concentrators. The 5′ overhang of the template DNA was hybridized by mixing with 1.2 equivalents of p0088–2×Cy3.5 DNA oligonucleotide at 68 °C for 5 min, followed by slow cool down (~1 h) to room temperature.

Single-round in vitro transcription assays

Stalled TECs were assembled in transcription buffer (50 mM Tris-HCl pH 8, 20 mM NaCl, 14 mM MgCl₂, 0.04 mM EDTA, 40 µg ml⁻¹ non-acylated BSA, 0.01% (v/v) Triton X-100 and 2 mM DTT) as described previously^15,67. In brief, 50 nM DNA template was incubated (20 min, 37 °C) with four equivalents of RNAP in the presence of 100 µM ACU trinucleotide, 5 µM GTP, 5 µM ATP (+150–300 nM ³²P α-ATP, Hartmann Analytic), halting the polymerase at U24, to prevent loading of multiple RNAPs on the same DNA template. Next, re-initiation of transcription was blocked by addition of 10 µg ml⁻¹ rifampicin. The RNAP was walked to the desired stalling site by addition of 10 µM UTP and incubating at 37 °C for 20 min. The 30S ribosomal subunit was loaded for 10 min at 37 °C by incubating 25 nM stalled TEC with 250 nM 30S (B68 mutant, pre-incubated with stoichiometric amounts of S1 protein for 5 min at 37 °C) in the presence of 2 µM IF2, 1 µM fmet–tRNA^fmet and 4 mM GTP in polymix buffer with 15 mM magnesium acetate. To enrich for fully assembled ribosome–RNAP complexes, the expressome was purified via immobilizing the 50S subunit (ZS22) onto streptavidin magnetic beads (NEB). For this, the 50S subunit was pre-annealed on h101 with prNQ302-prNQ303-prNQ304-p0109–biotin–DNA oligonucleotide. The prNQ303 and prNQ304 DNA oligonucleotides contain a BamHI cleavage site to elute the purified ribosome–RNAP complex from streptavidin beads. 50S loading onto the stalled TEC/30S PIC occurred simultaneously while immobilizing the ribosome–RNAP complex on streptavidin magnetic beads (pre-equilibrated with polymix buffer with 15 mM magnesium acetate) for 10 min at room temperature. Typically, 50 µl beads were loaded with a total volume of 150 µl ribosome–RNAP complex. The immobilized complex was washed once with polymix buffer with 15 mM magnesium acetate and then eluted with 100 µl polymix buffer with 15 mM magnesium acetate, while cleaving with BamHI for 20 min at 37 °C. The eluate was chased with 50 µM NTPs in the presence or absence of Nus factors (at 1 µM each, when present), 4 mM GTP, polymix buffer with 15 mM magnesium acetate and 100 mM potassium glutamate. Time points were taken before NTP addition (t = 0 s) and at 10, 20, 30, 40, 60, 90, 120, 180, 240, 360 and 600 s for each condition, mixing 2.5 µl sample with 5 µl stop buffer (7 M urea, 2× TBE, 50 mM EDTA, 0.025% (w/v) bromphenol blue and xylene blue) and incubated at 95 °C for 2 min. Single-round transcription assays with 70S in trans were performed as described above with minor changes. Ribosomal subunits were loaded at 1 µM concentration on 8 µM 6(FK) mRNA¹⁷ to ensure that 70S PIC formation was complete. Stalled TEC was added, and the reaction was chased with 50 µM NTPs (each) in the presence of 1 µM NusA, having final concentrations of 350 nM 70S PIC and 6.25 nM stalled TEC. As size reference, the ssRNA ladder (NEB, sizes: 50, 80, 150, 300, 500 and 1,000 nt) was 5′ end labelled with ³²P γ-ATP. The reactions were analysed on 6% denaturing PAGE (7 M urea in 1× TBE), running in 1× TBE and 50 W for 2–3 h. Gels were dried, exposed overnight on a phosphor screen (Cytiva, BAS IP MS 2040 E) and imaged using a Typhoon FLA 9500. Band intensities (P) were integrated using ImageLab 6.1 software (Bio-Rad) and divided by the total RNA per lane (T) to compensate for pipetting errors, as described⁶⁸. Normalized band intensities (P/T) were plotted as function of time. Pause-escape lifetimes were fitted as described⁶⁸ by plotting ln(P/T) against time and fitting the pause-escape data range (indicated in plots) with a linear equation (y = mx + b, with m being the pause-escape rate). All uncropped gels are presented in Supplementary Fig. 1.

Single-molecule transcription–translation coupling assays

Stalled TECs were formed as described for single-round transcription assays with following changes: (1) Typically, labelled DNA was used (pre-annealed with p0088–2×Cy3.5). (2) Stalled transcription was performed in the presence of 10 µM ATP and 10 µM GTP. (3) Re-initiation of transcription was blocked with 1 mg ml⁻¹ heparin (final concentration). (4) For immobilization, the 5′ end of the nascent mRNA was labelled with a biotin–DNA oligonucleotide (prNQ127-p0109–biotin). (5) When necessary, Cy5-labelled RNAP was used (N-terminal ybbR tag or C-terminal ybbR tag). Typically, ribosomes (250 nM 30S, 500 nM 50S) were loaded for 5 min at 37 °C by incubating 4 nM stalled TEC in the presence of 2 µM IF2, 1 µM fmet–tRNA^fmet and 4 mM GTP in polymix buffer with 15 mM magnesium acetate. The loading reaction was diluted to 100–800 pM stalled TEC concentration (DNA template-based) with an IF2-containing (2 µM) polymix buffer at 15 mM magnesium acetate and immobilized on biotin–PEG functionalized slides coated with NeutrAvidin for 10 min at room temperature⁶⁹. Unbound components were washed away with an IF2-containing (2 µM) imaging buffer (polymix buffer with 15 mM magnesium acetate, 100 mM potassium glutamate) and imaging was immediately started. The imaging buffer was supplemented with an oxygen scavenger system (OSC), containing 2.5 mM protocatechuic acid and 190 nM protocatechuate dioxygenase and a cocktail of triplet state quenchers (1 mM 4-citrobenzyl alcohol, 1 mM cyclooctatetraene and 1 mM trolox) to minimize fluorescence instability. For equilibrium experiments, the imaging buffer in addition contained 4 mM GTP and 1 µM Nus factors (each, exact composition indicated in figures). For real-time experiments, the reaction was initiated while imaging (after 10–30 s), with delivery of 50 nM 50S–Cy5 (where applicable), 2 µM IF2, 10–500 nM EF-G (where applicable, in specified concentrations, typically 50 nM), 100–1500 nM ternary complex (where applicable, in specified concentrations, typically 150 nM), 10–1,000 µM NTPs (each, where applicable, in specified concentrations) and 1 µM Nus factors (each, where applicable) in the same imaging buffer containing additional 4 mM GTP. The ternary complex was prepared as described previously^70,71. For 2-colour translation experiments, a second delivery mix was injected containing 10% of the initial OSC (in imaging buffer) to actively induce photobleaching after 20 min of movie acquisition.

Ribosomal subunit fluctuations

Spontaneous intersubunit rotations can complicate analysis and assignment of real translation transitions²². Therefore, we chose experimental conditions in which we can minimize the interference from spontaneous intersubunit rotations. We are using polymix buffer, which was shown by the most recent study by Ermolenko and colleagues²² to reduce the fraction of ribosomes showing spontaneous fluctuations by 20-fold with only 1% of the h44–Cy3/H101–Cy5 ribosomes showing spontaneous fluctuations²². Furthermore, we chose slower translation conditions in which the timescale of the spontaneous intersubunit rotations is faster (0.3–2 s⁻¹; see ref. ²²) and not even all spontaneous fluctuations may be detected at our frametime of 200 ms. The difference in timescales between real translations and intersubunit fluctuations is especially pronounced during ribosome slowdown for which the timescale of translation and spontaneous intersubunit rotations can differ by two orders of magnitude.

Single-molecule instrumentation and analysis

We performed all single-molecule experiments at 21 °C using a custom-built (by Cairn Research: https://www.cairn-research.co.uk/), objective-based (CFI SR HP Apochromat TIRF 100×C Oil) total internal reflection fluorescence (TIRF) microscope, equipped with an iLAS system (Cairn Research) and Prime95B sCMOS cameras (Teledyne Photometrics). For standard TIRF experiments (2 or 3 colours), we used a diode-based (OBIS) 532 nm laser at 0.6 kW cm⁻² intensity (on the basis of output power). The fluorescence intensities of Cy3, Cy3.5 and Cy5 dyes were recorded at exposure times of 200 or 300 ms. For alternative laser excitation³² (ALEX) experiments, we operated the 532 nm laser at 0.73 kW cm⁻² intensity and in every alternate frame, illuminated the samples with a diode-based 638 nm laser (Omicron LuxX) at 0.12 kW cm⁻² intensity (200 ms exposure time for each laser, resulting in a frame rate of approximately 2 frames per second). Typically, 10 min movies were recorded. For determination of FRET efficiencies, longer movies (20 min) were acquired to ensure photobleaching of both dyes.

Images were acquired using the MetaMorph software package (Molecular Devices) and single-molecule traces were extracted using the SPARTAN software package (v.3.7.0)⁷². Subsequent analysis was done using the tMAVEN⁷³ software (when applicable) and with scripts^19,74 written in MATLAB R2021a and previous versions (MathWorks). Data evaluation in brief: three-colour data was assigned by thresholding, while two-colour data was assigned by HMM, with both approaches detailed in the following. Data for Fig. 2 were both recorded as three-colour data (as presented in Fig. 2b,c) and also repeated using two-colour data (data presented in Fig. 2d–f are combined data from two-colour and three-colour replicates; see Source Data).
1. (1)
  
  For two-colour translation experiments (30S–Cy3 or 30S–Cy3B and 50S–Cy5; Fig. 2), molecules were selected in SPARTAN that had a single photobleaching step for donor and acceptor dye. Those traces were baseline corrected, corrected for donor emission bleedthrough and the apparent sensitivity of each fluorophore. Subsequently, traces were exported to tMAVEN. The evaluation windows were selected from initial 50S subunit joining until: (a) photobleaching of one of the dyes occurred, (b) one dye entered a dark state; or (c) distorted fluorescence intensities that made further assignment difficult. Transitions were detected and assigned to two states using HMM (composite → vbHMM + Kmeans). For downstream evaluation (Extended Data Fig. 1c–f), those assignments were exported to MATLAB. First, HMM assignments were visually inspected and manually corrected for anti-correlated intensity changes of both dyes (real FRET transitions; see top trace in Extended Data Fig. 1d). Occasionally, intermediate FRET states were encountered that were corrected to high-FRET states using thresholding. For example, for Cy3B, this manual correction reduced the final average transitions per trace from 14 to 12 for elongating conditions and from 5 to 4 for colliding conditions. Second, traces were corrected for spontaneous subunit fluctuations of the rotated state, encountered especially under colliding conditions (see Extended Data Fig. 1e). To distinguish rotated state fluctuations (1–2 s timescale²²) from real translations (non-rotated state median (3 replicates) = 12.8 ± 4.5 s, when using 150 nM aa–tRNA) we used dwell times for non-rotated states in elongating conditions as a threshold to cutoff non-translations (see Extended Data Fig. 1c–e). For this, we determined the 5% tile of all non-rotated state dwells in elongating conditions. Whenever 2 consecutive non-rotated dwells were encountered that were both shorter than the 5%-tile threshold (2.18/0.936 s; probability for 2 consecutive non-rotated dwells with <5%-tile duration to occur under normal translation conditions is 0.25%), translation count was stopped by putting those 2 dwells and all subsequent non-rotated state HMM intensities to rotated state HMM intensities (see Extended Data Fig. 1e, top trace). After that, also very short non-rotated dwells (<1%-tile; 0.936/0.5834 s) were removed. The last translation events in each trace were photobleaching-limited. We kept those in for data evaluation, as they inform on ribosome stalling after collision with the RNAP or after encountering the stop codon.
2. (2)
  
  For three-colour translation experiments (30S–Cy3, DNA–Cy3.5 and 50S–Cy5; Fig. 2), HMM assignment was not possible, due to spectral bleedthrough between channels (Fig. 2b,c; see also Extended Data Fig. 7c,d). Therefore, FRET transitions were assigned by use of trace-specific thresholds, selecting only for productive FRET states (high-FRET to low-FRET state or vice versa; Extended Data Figs. 1d,e and 7c,d and supplementary figure 11 in ref. ¹⁵). For the box plots shown in Fig. 2e, the following numbers of molecules/dwells were used (see also Extended Data Fig. 2c). Numbers for the first 6 amino acids (see Source Data for full list) are: elongating, non-rotated state: n = 315, 315, 313, 312, 306 and 296; elongating, rotated state: n = 315, 315, 312, 307, 298 and 283; colliding, non-rotated state: n = 307, 306, 302, 275, 213 and 137; colliding, rotated state: n = 307, 305, 293, 254, 187 and 117. Non-rotated and rotated state dwell times were used to calculate cumulative probability density functions of the observed data (ecdf, MATLAB) which were fitted to single-exponential functions in MATLAB (fit, using non-linear least squares methods). If initial double-exponential fitting of the data yielded a population that was represented less than 10%, the data was classified as following single-exponential kinetic behaviour⁷⁴.
3. (3)
  
  Two-colour equilibrium coupling experiments (Fig. 3) were acquired using ALEX. Expressome molecules were first selected for assembled expressomes (30S–Cy3 signal, and direct excitation signal of RNAP–Cy5) using SPARTAN. Next, traces were baseline corrected, corrected for the following: (a) donor emission bleedthrough, (b) direct excitation of the acceptor dye and (c) the apparent sensitivity of each fluorophore. Selected traces were exported to tMAVEN. The evaluation windows were selected to include only times, where the RNAP–Cy5 (direct) signal was alive for at least 100 s. To model the number of FRET states we used HMM (Global → vbConsensus + Model selection). The resulting number of states was used to assign FRET states using vbHMM + Kmeans. Selected evaluation windows and assignments were exported for downstream evaluation. FRET efficiencies were extracted with MATLAB using the evaluation windows and fitted to two or three gaussian distributions (depending on modelled states number) using maximum likelihood parameter estimation in MATLAB. For dwell time analysis, HMM assignments based on tMAVEN were visually inspected in MATLAB and corrected for true anti-correlated behaviour for donor and acceptor dyes. States with E_FRET = 0 were assigned to uncoupled and all states with E_FRET > 0 (loosely coupled and coupled) binned to a single coupled expressomes state. Recoupling rates were obtained by single-exponential fitting of cumulative dwell time distributions for dwells with E_FRET = 0.
4. (4)
  
  For three-colour real-time coupling experiments (Fig. 4; ribosome stalled on RBS, RNAP chased with NTP), assembled expressome molecules were selected (30S–Cy3 signal, DNA–2×Cy3.5 signal and in the case of ALEX, direct excitation signal of RNAP–Cy5) and assigned by thresholding, due to spectral bleedthrough between channels (see Extended Data Fig. 7c,d). When applicable, dwell times for transcription or coupling were extracted. For determination of the fraction of coupled molecules until the end of transcription, we assigned the characteristic signal (30S–Cy3/DNA–2×Cy3.5 FRET) and calculated: fraction coupled = (number of coupled molecules with 30S–DNA FRET)/(total number of expressome molecules).
5. (5)
  
  For two- to three-colour transcription data evaluation, only traces with single expressome molecules (containing 30S–Cy3 (if present) and DNA–2×Cy3.5) were used. Transcription times were evaluated by assigning the time after reagent delivery until DNA template dissociation using trace-specific thresholds (Extended Data Fig. 7c,d). Average transcription times (Extended Data Fig. 1g) were obtained by fitting dwell time distributions to a convolution of a Gaussian (describes transcription) and exponential function (describes RNAP-stalling at the 3′ end before termination)¹⁵.
For representation, the single-molecule traces were smoothed by zero-phase digital filtering by 3 points using the filtfilt function in matlab, but unsmoothed data was used for data evaluation.

Structures were visualized with Pymol (v 2.4) or ChimeraX 1.3. All figures were prepared with MATLAB R2021a, Excel 2016 and Adobe Illustrator.

Statistical analysis

Reported error bars represent the s.d. of replicates as indicated in the figure captions. Errors in dwell time fits represent 95% confidence intervals obtained from fits to single-exponential functions as indicated. Statistical details of individual experiments, including number of replicates, analysed molecules or number of dwells used in the dwell time analyses are described in the manuscript text, figure legends, supplementary information or the Source Data. Every single-molecule experiment was performed on a different sample with two to three biological replicates (see details on number of replicates in figure captions and more information in Source Data file). P values were determined via two-sided Wilcoxon–Mann–Whitney test in MATLAB and are reported as *P < 0.05, **P < 0.01 and ***P < 0.001. Values and test statistic can be found in Source Data. In all box plots, the centre line is the median, box edges indicate 25th and 75th percentiles and whiskers extend to 1.5× interquartile range.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
[ad_2]

Source link
December 4, 2024
AI has dreamt up a blizzard of new proteins. Do any of them actually work?

[ad_1]

Download the 29 November long read podcast

AI tools that help researchers design new proteins have resulted in a boom in designer molecules. However, these proteins are being churned out faster than they can be made and tested in labs.

To overcome this, multiple protein-design competitions have popped up, with the aim of sifting out the functional from the fantastical. But while contests have helped drive key scientific advances in the past, it’s unclear how to identify which problems to tackle and how best to select winners objectively.

This is an audio version of our Feature: AI has dreamt up a blizzard of new proteins. Do any of them actually work?

Never miss an episode. Subscribe to the Nature Podcast on Apple Podcasts, Spotify YouTube Music or your favourite podcast app. An RSS feed for Nature Podcast is available too.

[ad_2]

Source link

November 29, 2024
AI protein-prediction tool AlphaFold3 is now open source

[ad_1]

AlphaFold3 can predict the structures of proteins as they interact with DNA.Credit: Werel et al./American Society for Microbiology, Mol*, RCSB PDB

AlphaFold3 is open at last. Six months after Google DeepMind controversially withheld code from a paper describing the protein-structure prediction model, scientists can now download the software code and use the artificial intelligence (AI) tool for non-commercial applications, the London-based company announced on 11 November.

“We’re very excited to see what people do with this,” says John Jumper, who leads the AlphaFold team at DeepMind and last month, along with CEO Demis Hassabis, won a share of the 2024 Chemistry Nobel Prize for their work on the AI tool.

Major AlphaFold upgrade offers boost for drug discovery

AlphaFold3, unlike its predecessors, is capable of modelling proteins in concert with other molecules. But instead of releasing its underlying code — as was done with AlphaFold2 — DeepMind provided access via a web server that restricted the number and types of predictions scientists could make.

Crucially, the AlphaFold3 server prevented scientists from predicting how proteins behave in the presence of potential drugs. But now, DeepMind’s decision to release the code means academic scientists can predict such interactions by running the model themselves.

The company initially said that making AlphaFold3 available only through a web server struck the right balance between enabling access for research and protecting commercial ambitions. Isomorphic Labs, a DeepMind spinoff company in London, is applying AlphaFold3 to drug discovery.

But the publication of AlphaFold3 without its code or model weights — parameters obtained by training the software on protein structures and other data — drew criticism from scientists, who said the move undermined reproducibility. DeepMind swiftly reversed course and said it would make an open-source version of the tool available within half a year.

Anyone can now download the AlphaFold3 software code and use it non-commercially. But for now, only scientists with an academic affiliation can access the training weights on request.

Accessible versions

DeepMind has got competition: over the past few months, several companies have unveiled open-source protein structure prediction tools based on AlphaFold3, relying on specifications described in the original paper known as pseudocode.

Two Chinese companies — technology giant Baidu and TikTok developer ByteDance — have rolled out their own AlphaFold3 inspired models, as has a start-up in San Francisco, California, called Chai Discovery.

AlphaFold3 — why did Nature publish it without its code?

A key limitation of these models is that, like AlphaFold3, none is licensed for commercial applications such as drug discovery, says Mohammed AlQuraishi, a computational biologist at Columbia University in New York City. However, Chai Discovery’s model, Chai-1, can be used via a web server for such work, says Jack Dent, a co-founder of the company.

Another firm, San Francisco-based Ligo Biosciences, has released a restriction-free version of AlphaFold3. But it doesn’t yet have the full suite of capabilities, including the capacity to model drugs and molecules other than proteins.

Other teams are working on versions of AlphaFold3 that don’t come with such limits: AlQuraishi hopes to have a fully open-source model called OpenFold3 available by the end of the year. This would enable drug companies to retrain their own versions of the model using proprietary data, such as the structures of proteins bound to different drugs, potentially improving performance.

Openness matters

The last year has seen a flood of new biological AI models released by companies with varying approaches to openness. Anthony Gitter, a computational biologist at the University of Wisconsin-Madison, has no problem with for-profit companies joining his field — so long as they play by the same rules as other scientists when they share their work in journals and preprint servers.

If DeepMind makes claims about AlphaFold3 in a scientific publication, “I and others expect them to also share information about how predictions were made and put the AI models and code out in a way that we can inspect,” Gitter adds. “My group’s not going to build on and use the tools that we can’t inspect.”

Not all ‘open source’ AI models are actually open: here’s a ranking

The fact that several AlphaFold3 replications have already emerged shows that model was reproducible, even without open-source code, says Pushmeet Kohli, DeepMind’s head of AI for science. He adds that in future he would like to see more discussion about the publishing norms in a field increasingly populated by both academic and corporate researchers.

The open-source nature of AlphaFold2 led to a flood of innovation from other scientists. For instance, the winners of a recent protein design contest used the AI tool to design new proteins capable of binding a cancer target. Jumper’s favourite recent AlphaFold2 hack was from a team that used the tool to identify a key protein that helps sperm attach to egg cells.

Jumper can’t wait for such surprises to emerge after sharing AlphaFold3 — even if they don’t always bear fruit. “People will use it in weird ways,” he predicts. “Sometimes it will fail and sometimes it will succeed.”

[ad_2]

Source link

November 11, 2024
Ab initio characterization of protein molecular dynamics with AI2BMD

[ad_1]

Protein fragmentation approach

Generally speaking, proteins are composed of 20 kinds of amino acid, each of which has a common main chain consisting of Cα, C, O, N and H, and a different side chain (termed the R group). A dipeptide is an amino acid capped with Ace and Nme groups at its N and C termini, respectively. As amino acids are the fundamental units of proteins, we designed the generalizable protein fragmentation approach on the basis of dipeptides and trained AI²BMD potential accordingly, which ensures the generalization ability to all proteins.

The concept of peptide fragmentation has been around for years, and previous studies have demonstrated its accuracy and efficiency to proteins^10,42. As shown in Extended Data Fig. 7a, each dipeptide consists of: all atoms of the main chain and the side chain of the amino acid; the Cα, H connected to Cα, C, O of the main chain of the previous amino acid; and the N, H connected to N, Cα and H connected to Cα of the main chain of the next amino acid. We cut the polypeptide chains with a sliding window, and thus the Ace-Nme fragments act as the overlapping regions between two successive dipeptides (Extended Data Fig. 7b). The extra hydrogens for the terminal Cα were added to dipeptides and Ace-Nme fragments, according to the C–H bond length and the direction of the bond connected to the Cα in the whole peptide chain. If the first or last amino acid was glycine, we added only one hydrogen connected to Cα according to the C–H bond length. If the latter amino acid is proline, we also added a hydrogen connected to N according to the N–H bond length where the N is connected to Cδ. Then, the limited-memory Broyden–Fletcher–Goldfarb–Shanno quasi-Newton algorithm⁴³ was applied to optimize the positions of the added hydrogens while the other parts were constrained.

We first calculated the total energy and force for all protein units by summing up the energies of dipeptides, subtracting the energies of all overlapping Ace-Nme fragments (equation (1)).

$${E}^{{\rm{prot}}\_{\rm{units}}}=\mathop{\sum }\limits_{i=1}^{n}{E}_{i}^{{\rm{dipeptide}}}-\mathop{\sum }\limits_{i=1}^{n-1}{E}_{i}^{{\rm{Ace}}\text{-}{\rm{Nme}}}$$

(1)

in which n is the number of amino acids or dipeptides.

The force for atoms in the same dipeptide and Ace-Nme is calculated following equation (2).

$${F}_{i}^{{\rm{prot}}\_{\rm{units}}}=\mathop{\sum }\limits_{j=1}^{m}{F}_{ij}^{{\rm{dipeptide}}}-\mathop{\sum }\limits_{j=1}^{n}{F}_{ij}^{{\rm{Ace}}\text{-}{\rm{Nme}}}$$

(2)

in which i denotes the atom for force calculation, m represents all of the dipeptides the atom i belongs to, n represents all of the Ace-Nme fragments the atom i belongs to, and j represents any other atom that coexists with atom i in the same dipeptide or Ace-Nme.

We further complemented the extra interactions among non-overlapped protein units. Supplementary Fig. 18 shows the extra interactions among different protein units of tetrapeptide. Two parts of interactions were not calculated. Extended Data Fig. 8a illustrates the extra interactions between the group of CH¹, C¹, O¹ and NH¹ (outlined in purple) and the last part of the tetrapeptide (also outlined in purple). Furthermore, Extended Data Fig. 8b exhibits extra interactions between the beginning part of the tetrapeptide that includes CH3⁰, C⁰, O⁰ and NH¹ (outlined in brown) and another part beginning from the second side chain to the C terminus (also outlined in brown).

Considering that the interactions in such non-overlapped regions are dominated by electrostatic force and van der Waals interactions, we used the Coulomb equation and the Lennard-Jones potential to describe them. Then, we used the corresponding parameters derived from the Amber ff19SB force field²⁰ and the distance between atoms to calculate the potential energy and atomic forces for the extra interactions (equations (3) and (4)). The selection of ff19SB was informed by its superior performance in evaluating the relative energy when compared to another widely used force field, CHARMM36 (ref. ⁴⁴), as illustrated in Supplementary Fig. 11.

$${E}^{{\rm{prot}}}={E}^{{\rm{prot}}\_{\rm{units}}}+\mathop{\sum }\limits_{\begin{array}{c}i=1\\ i\in A\end{array}}^{n-1}\mathop{\sum }\limits_{\begin{array}{c}j=i+1\\ j\notin A\end{array}}^{n}{E}_{ij}^{{\rm{Coulomb}}}+\mathop{\sum }\limits_{\begin{array}{c}i=1\\ i\in A\end{array}}^{n-1}\mathop{\sum }\limits_{\begin{array}{c}j=i+1\\ j\notin A\end{array}}^{n}{E}_{ij}^{{\rm{VDW}}}$$

(3)

$${F}_{i}^{{\rm{prot}}}={F}_{i}^{{\rm{prot}}\_{\rm{units}}}+\mathop{\sum }\limits_{\begin{array}{c}j=i+1\\ j\notin A\end{array}}^{n}{F}_{ij}^{{\rm{Coulomb}}}+\mathop{\sum }\limits_{\begin{array}{c}j=i+1\\ j\notin A\end{array}}^{n}{F}_{ij}^{{\rm{VDW}}}$$

(4)

in which the energy and force with the superscript ‘units’ represent the values obtained from equation (1) to equation (2), and A denotes the atom set in the corresponding unit. To avoid double counting, the sum traverses all of the atoms with the index after the current atom.

Given the protein fragmentation approach, all proteins can be converted into 21 kinds of protein unit (that is, 20 kinds of dipeptide and another Ace-Nme), which substantially reduced the number of specific types of protein unit, facilitated dataset construction and model training, contributed to exploring the whole conformational space, avoided holes in the potential energy surface, and thus improved the generalization, efficiency and robustness of the MD simulation.

Protein unit dataset

The training dataset for the AI²BMD potential was generated through the following protocols. First, the ‘Sequence’ command in the tleap module of AmberTools20 (ref. ⁴⁵) was used to generate the topology and coordinate files for the initial 20 kinds of dipeptide and Ace-Nme. Then, ϕ (that is, the dihedral of C–N–Cα–C) and ψ (that is, the dihedral of N–Cα–C–N) were two-dimensionally scanned over ranges of −180° to 175° with an interval of 5°. For the proline dipeptide, the ϕ dihedral was refined from −180° to 120° owing to its ring conformation. The rotation of the dihedral was accomplished by the ‘rotatedihedral’ command in CPPTRAJ⁴⁶. For each non-proline dipeptide, 5,184 anchors were generated. For Ace-Nme, scanning over ranges of −180° to 175° with an interval of 5° was applied on the axis of C of Ace and N of Nme resulting in 72 anchors.

Each anchor first encountered a geometry optimization (‘GO’) process to obtain a reasonable structure. The solvation model density (SMD) solvent model was used during GO. The ϕ and ψ dihedrals were also constrained the same as for anchor generation. For each anchor, the last structure of the GO process was used as the input structure for AIMD simulations. SMD was used to sample conformations by taking the solvent effect in QM into consideration. For dipeptides, 225-fs simulations were applied for each anchor, and the last 200-fs structures were extracted. Simulations of 2,025 fs were carried out for Ace-Nme, and the last 2,000 fs was extracted for each trajectory. As an explicit solvent was used during MD simulation driven by our AI²BMD potential, after AIMD simulations, we recalculated the single-point energy and forces for all extracted conformations without SMD, which were used for MLFF training.

During GO, AIMD simulations and single-point energy calculation, DFT with the M06-2X density functional with the 6-31g* basis set were used⁴⁷. This basis set and functional are generally suitable for biomolecular sampling^14,15,48,49. We set tight convergence conditions in the simulation processes, and convergence was mandatory for the next calculation step. Systems encountered a canonical sampling through velocity rescaling thermostat at 290 K (ref. ⁵⁰). Such simulations were performed by ORCA 5.0.1 (ref. ⁵¹). The charge of each system was set according to the charge for the sum of all amino acids at pH 7. The GO, AIMD simulations and single-point energy calculation took about 12,928,993 central processing unit (CPU) core hours (1,476 CPU core years) for calculation. As a result, 1,036,800 conformations were sampled and calculated at the DFT level for each kind of dipeptide and 144,000 conformations were sampled for Ace-Nme. The distributions of energy and the norm of force for each kind of protein unit are shown in Supplementary Tables 4 and 5 and Supplementary Figs. 12 and 13. The whole protein unit dataset consists of 20 million conformations that comprehensively captured the conformational space of the protein units and provided a solid guarantee for machine learning potential training and AI²BMD simulation.

ViSNet as AI²BMD potential

ViSNet is a versatile geometric deep learning model^7,52 that can predict potential energy and atomic forces, as well as various quantum chemical properties, by taking atomic coordinates and atomic numbers as inputs. As shown in Supplementary Fig. 2a, the ViSNet model is composed of an embedding block and multiple stacked ViSNet blocks, followed by an output block. The atomic number and coordinates are fed into the embedding block followed by ViSNet blocks to extract and encode geometric representations. The geometric representations are then used to predict molecular energy and force through the output block. Supplementary Fig. 2b demonstrates the ViSNet block, which consists of a message block and an update block. These blocks work together as parts of a vector scalar interactive message-passing mechanism, referred to as ViS-MP. The rich geometric information passed via ViS-MP is extracted by the runtime geometric calculation module with linear complexity. The operations in Supplementary Fig. 2b can be summarized as follows:

$${m}_{i}^{l}=\sum _{j\in {\mathscr{N}}(i)}{\phi }_{m}^{s}({h}_{i}^{l},{h}_{j}^{l},{f}_{{ij}}^{l})$$

(5)

$${{\bf{m}}}_{i}^{l}=\sum _{j\in {\mathscr{N}}(i)}{\phi }_{m}^{v}({m}_{ij}^{l},{{\bf{r}}}_{ij},{{\bf{v}}}_{j}^{l})$$

(6)

$${h}_{i}^{l+1}={\phi }_{un}^{s}({h}_{i}^{l},{m}_{i}^{l},\langle {{\bf{v}}}_{i}^{l},{{\bf{v}}}_{i}^{l}\rangle )$$

(7)

$${f}_{ij}^{l+1}={\phi }_{ue}^{s}\,({f}_{ij}^{l},\langle {\text{Rej}}_{{{\bf{r}}}_{ij}}\,({{\bf{v}}}_{i}^{l})\,,{\text{Rej}}_{{{\bf{r}}}_{ji}}({{\bf{v}}}_{j}^{l})\rangle )$$

(8)

$${{\bf{v}}}_{i}^{l+1}={\phi }_{un}^{v}({{\bf{v}}}_{i}^{l},{m}_{i}^{l},{{\bf{m}}}_{i}^{l})$$

(9)

in which \({h}_{i}^{l}\) represents the scalar feature of node \(i\) in the lth layer, \({{\bf{v}}}_{i}^{l}\) represents the vectorized node feature and \({f}_{{ij}}\) represents the scalar edge feature between node \(i\) and node \(j\). \({\phi }_{m}^{s}\), \({\phi }_{m}^{v}\) are nonlinear message functions to transform messages from neighbours and \({\phi }_{{un}}^{s}\), \({\phi }_{{ue}}^{s}\), \({\phi }_{{un}}^{v}\) are nonlinear update functions to update the corresponding feature according to the message and geometric features. More details about runtime geometric calculation and ViS-MP can be found in ref. ⁷.

For each kind of protein unit, ViSNet was trained as an energy-conserving potential model; that is, the predicted atomic forces were derived from the negative gradients of the potential energy with respect to the atomic coordinates. We randomly split each protein unit dataset into a training set, a validation set and a test set with the ratio of 8:1:1. Hyperparameters were tuned on the validation set of the alanine dipeptide and directly applied to other protein units. Concretely, all ViSNet models trained for protein units were relatively light with only 6 hidden layers and 128 embedding dimensions for node and edge representations. To better capture geometric information, we expanded the raw three-dimensional coordinates of molecules by adapting higher-order spherical harmonics⁵³. The cutoff of the edge connection was set to 5 Å for all protein units, and the maximum number of neighbours for each atom was 32. We leveraged a combined mean squared error loss for energy and force training with the weight of 0.05 and 0.95, respectively. We adopted a learning rate of 2 × 10⁻⁴ with 1,000 warm-up steps⁵⁴ using the AdamW optimizer⁵⁵. The learning rate decays if the validation loss stopped decreasing. The patience was set to 15 epochs, and the decay factor was set to 0.8. We also adopted an early-stopping strategy to prevent over-fitting⁵⁶. The maximum number of epochs was set to 6,000, and the early-stopping patience was 150 epochs. All models were trained on a GPU cluster with 16 NVIDIA 32G-V100 GPUs per cluster node, and the batch size was 64 or 128 per GPU according to the size of the protein units. To make the model converge better, for the training set, we subtracted the sum of atomic reference energies from the total energy and then normalized them with Z-score normalization. More details on the hyperparameters of ViSNet can be found in Supplementary Table 6.

AI²BMD simulation program

To carry out simulations with the AI²BMD potential, we designed an AI-driven MD simulation program based on the atomic simulation environment⁵⁷. Extended Data Fig. 9 illustrates the overview block diagram of the program. On program start, the initial protein structure is fed into the preprocessing module, where the solvent and ions are added, and the structure is relaxed. The entire simulation system is then sent into the MD loop, the main logic component. For each iteration in the MD loop, the protein is first decomposed into fragments by the protein fragmentation module and then partitioned by the work scheduler. The partitioning scheme is dictated by a tunable device strategy, and a user can choose to, depending on the size of the simulation system, instruct the work scheduler to maximize the utilization of all the GPUs by oversubscribing, and to reduce the memory pressure on a particular device by balancing the computation on different fragments across the GPU cards. The partitioned fragments and the solvent atoms are then asynchronously sent to different computation servers running in separate processes. This asynchronous client–server paradigm helps to alleviate a substantial limitation in the Python runtime: that only a single thread can execute Python code at a time in the same process. After the workload is distributed from the main component to the computation servers, it will be processed in parallel, and the main Python process can immediately resume processing other tasks such as persisting trajectory data, without being blocked by the servers.

Considering that cloud computing is a popular and cost-efficient way to support scientific computing workloads, we designed the simulation process to be cloud-oriented. The software configuration is fully defined with a Docker image and remains invariant across different machines, which allows us to not only effortlessly deploy the software system to the cloud, but also fine-tune the program against a fixed set of supporting libraries. As cloud-based machines may be pre-empted, and the machine-local storage is volatile during a long-time simulation, we implemented a job scheduling component that periodically persists the computation results to cloud-based storage and resumes the simulation.

System configuration in simulation

We prepared the biomolecular systems using the Amber20 package with the AMOEBA 13 force field¹⁶. The protein was first solvated in a cubic TIP3P⁵⁸ water box and then was relaxed in energy minimization cycles. Then, NaCl atoms as counterions and another 0.15 mol l⁻¹ buffer were added. We used classical Amber Coulombic potential-based methods to add ions. Initially, a grid of 1 Å bin size was generated, and all grids point Coulombic potentials were calculated. Then, the ions were placed on the grid where the contrast types of Coulombic potential were the highest. If an ion had a steric conflict with a solvent molecule, the ion was moved to the centre of that solvent molecule, and the latter was removed.

We adopted a hybrid calculation strategy for the simulation system; that is, the proteins were calculated by the AI²BMD potential with ab initio accuracy, whereas the AMOEBA 13 force field was used to deal with the solvent. The total energy of the system (\({E}^{{\rm{total}}}\)) is computed as the sum of the deep learning (DL) energy calculated by ViSNet (\({E}_{{\rm{DL}}}^{prot}\)) for the protein, and the energy from the MM calculation for the entire system (\({E}_{{\rm{MM}}}^{{\rm{total}}}\)). Then, to avoid double counting the energy contribution from the protein, the energy of protein atom interactions (\({E}_{{\rm{MM}}}^{{\rm{prot}}}\)) is subtracted from the total energy as shown in the following equation. Such calculations were based on the classical integrated molecular orbital + MM model implanted in the atomic simulation environment package^59,60.

$${E}^{{\rm{total}}}={E}_{{\rm{DL}}}^{{\rm{prot}}}+{E}_{{\rm{MM}}}^{{\rm{total}}}-{E}_{{\rm{MM}}}^{{\rm{prot}}}$$

(10)

Similarly, the force \({F}_{i}^{{\rm{total}}}\) for atom i is initially set as the forces from the interactions between atom i and all other atoms in the protein (\({F}_{i}^{{\rm{prot}}}\)), which is depicted in equation (4). To account for the solvent effect, an additional force was calculated between atom \(i\) and the other atoms in the system by the AMOEBA force field (the second item in equation (11)), and then the solute items were subtracted (the third item in equation (11)).

$${F}_{i}^{{\rm{total}}}={F}_{i}^{{\rm{prot}}}\,+\,\mathop{\sum }\limits_{\begin{array}{c}j\ne i\\ j\in B\end{array}}^{n}{F}_{ij}-\,\mathop{\sum }\limits_{\begin{array}{c}j\ne i\\ j\in {\rm{C}}\end{array}}^{n}{F}_{ij}$$

(11)

in which B represents all atoms in the entire system, and \(C\) represents the atoms in the solute. Furthermore, to calculate \({E}_{{\rm{DL}}}^{{\rm{prot}}}\) and \({F}_{{\rm{DL}}}^{{\rm{prot}}}\), we first split the protein into protein units, calculated the potential energies and atomic forces by ViSNet models and then combined all protein units by equation (3). More details about the protein fragmentation and ViSNet potential calculation can be found in the above sections. A simulation carried out under an NVE ensemble demonstrates the conserved total energy and the counterbalanced forces, thereby further substantiating the validity of subsequent sampling procedures (Supplementary Figs. 14 and 15). Furthermore, we also carried out the simulation for the same arginine dipeptide under an NVT ensemble and calculated the heat capacity by the following equation:

$${C}_{{\rm{V}}}={\left(\frac{\partial U}{\partial T}\right)}_{{\rm{V}}}=\frac{\langle {E}^{2}\rangle -{\langle E\rangle }^{2}}{{k}_{{\rm{B}}}{\langle T\rangle }^{2}}$$

(12)

in which \(\langle {E}^{2}\rangle \) denotes the ensemble average of the square value of the system energy and \({\langle E\rangle }^{2}\) denotes the square value of the ensemble average of the system energy. The heat capacity values made by MM and AI²BMD are 0.052 kcal mol⁻¹ K⁻¹ and 0.053 kcal mol⁻¹ K⁻¹, which are similar and comparable to those of previous experimental studies. The subsequent simulations in this study were run in the Berendsen NVT ensembles with initial velocities randomly drawn from a Maxwell–Boltzmann distribution. The time step in this study was set to 1 fs. During the simulation, the trajectory would be written to a high-precision XYZ file.

Simulation details

In the evaluation of protein energy and force calculation, protein structures (Protein Data Bank (PDB) IDs: chignolin, 5AWL; Trp-cage, 2JOF; WW domain, 2F21; albumin-binding domain, 1PRB; PACSIN3, 6F55; SSO0941, 5VFK; APC, 5IZA; polyphosphate kinase, 1XDO; aminopeptidase N, 4XN9) were solvated in a generalized Born implicit solvent model. The alteration on the WW domain follows the GTT mutation in the previous study⁴, and the first five flexible residues in the albumin-binding domain were removed. The Amber program makeCHIR_RST was used to create chiral restraint files during replica-exchange molecular dynamics (REMD) simulation to preserve chiral properties at high temperatures. After 1,000 steps of minimization, equilibration runs of 200 ps were conducted at temperatures ranging from 300 K to 1,000 K with a stride of 100 K. The final equilibrated structures were used for REMD simulations at the corresponding temperatures. During the simulation, each replica ran for 2 ps before exchanging with neighbouring temperatures and 5,000 exchanges occurred in each production run. REMD trajectories were divided into three states according to the Cα RMSD against the crystal structure. Specifically, for chignolin, the folding structures have an RMSD of 0–2.5 Å, the intermediate structures have an RMSD of 2.5–7.5 Å, and the unfolding structures have an RMSD of more than 7.5 Å. For other proteins, the ranges of the three states are 0–5 Å, 5–15 Å and >15 Å. Then, folding and unfolding states were further divided into 5 clusters, and the intermediate structures were divided into 10 clusters via the CPPTRAJ ‘cluster’ program. We picked the structures of each cluster centre, accumulating 20 initial structures in total. Finally, each initial structure was solvated in a 5-Å TIP3P water box and encountered 10 steps of 1-fs AI²BMD simulation. Simulations were carried out under an NVT ensemble. The simulation temperature (300 K) was controlled by a Berendsen thermostat and τ was 10 fs. The reference energy and force of the corresponding structures were calculated at the M06-2X–6-31g* level. MM energy and force were calculated by the ff19SB force field.

For sampling on Ace-N-Nme, we constructed the system using the ‘sequence’ command in tleap, and then applied a 10-ns REMD simulation, identical to the one used for protein sampling. From this, we extracted 50 representative structures using the CPPTRAJ ‘cluster’ program. We then conducted 10-ps simulations for each initial structure using AI²BMD with AMOEBA polarizable embedding, resulting in a cumulative sampling time of 500 ps. We also implemented 10-ps simulations using QM–MM with AMOEBA polarizable embedding and MM with Amber ff19SB on these conformations. Each simulation incorporated a water box of 5 Å. We then examined each snapshot during the simulations to locate any water molecules that formed a hydrogen bond with the main chain or side chain (criteria: frequency >90%, donor atom distance <3.5 Å, O–H–O angle >150°). Subsequently, we delineated the distribution of donor atom distances. Following the formation of one hydrogen bond, we isolated the water and Ace-N-Nme molecules and incrementally pulled the water from 2.5 Å to 4.0 Å to form 150 structures. Finally, we carried out single-point energy evaluation on the system of the water molecule and the dipeptide by QM at the M06-2X–6-31g* level, AI²BMD with AMOEBA solvent and MM with ff19SB.

For AI²BMD simulation on dipeptides, we first generated the conformations of the dipeptides through the ‘sequence’ command in tleap. Then, the dipeptides were solvated in a 5-Å TIP3P water box, and we ran two repetitive 500-ns classical MD simulations under the ff19SB force field for sufficient sampling. k-means clustering was then applied, and 50 representative structures were picked up. Starting from the representative structures, we carried out 10-ns AI²BMD simulation for the negatively charged protein unit Ace-E-Nme, the positively charged Ace-R-Nme, Ace-F-Nme with a benzene ring in the side chain and Ace-S-Nme with a smaller side chain solvated by a 10-Å water box under an NVT ensemble. Furthermore, for J coupling analysis, 2 independent runs of 1-ns AI²BMD simulations were used, and 10,000 snapshots were saved. Then, ϕ values were estimated from each snapshot. The ³J(H_N, H_α) coupling value was calculated through equation (13).

$$J=7.09\,{\cos }^{2}(\phi -6{0}^{^\circ })-1.42\cos (\phi -6{0}^{^\circ })+1.55$$

(13)

For AI²BMD simulation on chignolin, we first aligned the structures in a 106-μs comprehensively sampled trajectory to the initial structure. Then, time-lagged independent component analysis was used on raw atom coordinates for projecting the free-energy landscape to a six-dimensional super surface⁶¹. On the basis of minibatch k-means algorithm, we clustered all conformations and then picked up 60 folded and unfolded structures as the representative structures. Then for each structure, we ran 10-ns AI²BMD and 10-ns MM simulations.

In the Ramachandran plot, ϕ is the dihedral angle determined by C_n−1, N_n, Cα_n and C_n, and ψ is the dihedral angle determined by N_n, Cα_n, C_n and N_n+1. The subscript represents the index of a residue in a protein. The energy was estimated according to a Boltzmann distribution based on the density of points in each bin. This estimation was carried out using the potential of mean force. ϕ and ψ were set as two reaction coordinates (x, y). The potential of mean force values were calculated using equation (14).

$$\Delta G(x,y)={k}_{{\rm{B}}}T{\rm{ln}}\,g(x,y)$$

(14)

in which \({k}_{{\rm{B}}}\) represents the Boltzmann constant, T is the temperature of systems (300 K) and \(g(x,{y})\) represents the normalized joint probability distribution. The free-energy value presented in the plot represents a relative energy value, computed by deducting the minimum free-energy value from the observed value. The Q score was calculated through equation (15) (ref. ²³).

$$Q=\frac{1}{N}\sum _{(i,\,j)}\frac{1}{1+\exp [5({r}_{{ij}}(X)-1.8\,{r}_{{ij}}^{0})]}$$

(15)

Native contacts were defined as any pairs of heavy atoms of two residues separated by at least three residues and the distance of which is smaller than 4.5 Å in the native conformation. Equation (14) sums N pairs of native contacts in the crystal structure; \({r}_{{ij}}^{0}\) is the distance between heavy atom i and atom j of native contacts in the crystal structure, \({r}_{{ij}}(X)\) is the distance between atom i and atom j in the conformation X. The thresholds of Q values for folded and unfolded structures were set to >0.82 and <0.03, respectively²³.

For free-energy estimation for fast-folding proteins, we first evenly sampled 100,000 points in the simulation trajectories of ref. ⁴. Folded and unfolded states were classified by the same thresholds of Q values in the previous study²³. Structures in the folded state were clustered into 10 clusters. The RMSD values were calculated on the basis of Cα coordinates according to the ‘rmsd’ method in MDTraj. ΔG, the free energy for the folding process, was calculated according to the ratio between the folded and unfolded structures. Using the re-evaluated energy for each conformation, we determined the folding enthalpy and the heat capacity change for protein folding. The melting temperature was extrapolated from the calculated ΔG, folding enthalpy and the heat capacity change.

For the calculation of changes in enthalpy and heat capacity during protein folding and unfolding, 110-residue barnase (PDB: 1A2P) and 84-residue CI2 (PDB: 2CI2) were selected for evaluation with enthalpy and heat capacity values measured by differential scanning calorimetry and spectroscopy^36,37. For each protein, besides the folded structure derived from PDB, 20 unfolded structures were also generated for simulation. Following the previous study³⁵, each conformation was explicitly solvated by a 10-Å water box. For barnase, 20 parallel simulations starting from the folded structure and 20 simulations starting from the unfolded structures were performed by GROMACS 2018 with the CHARMM36 force field at pH 4.1 and at temperatures of 295 K, 315 K and 335 K. The same settings were applied for CI2, except the simulations were carried out at pH 6.3 and temperatures of 335 K, 350 K and 365 K. Each system configuration above was conducted 2-ns simulation under an NPT ensemble. Potential energy values of conformations sampled from simulations were calculated by AI²BMD. The enthalpy change following thermal unfolding (ΔH) was calculated as the difference between the averaged enthalpy of the unfolded ensemble and that of the folded ensemble. We then conducted linear regression to determine the change in heat capacity (ΔC_p) from the slope, as well as the enthalpy change at the melting temperature. Additionally, we also estimated the folding free energy using the Gibbs−Helmholtz equation.

In the pK_a determination using thermodynamics integration, we initially reweighted for all data points in the simulation trajectories provided by The Amber Project (https://ambermd.org/tutorials/advanced/tutorial6/index.php). Subsequently, we focused on the trajectories’ converged sections, selecting 2,500 data points per window for the dipeptide and 500 data points per window for thioredoxin to calculate the mean energy values. ΔG was computed using the integral:

$$\Delta G=\int \frac{\partial U}{\partial \lambda }{\rm{d}}\lambda =\sum _{\lambda }{w}_{\lambda }\frac{\partial U}{\partial \lambda }$$

(16)

in which w_λ represents the window width, U denotes the internal energy, and λ specifies the sampling window. This approach encapsulates the free-energy variation across different protonation states, facilitating the accurate computation of the pK_a value.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

November 6, 2024
Engineered receptors show how humans tell countless odour molecules apart

[ad_1]

Nature, Published online: 30 October 2024; doi:10.1038/d41586-024-03396-0

How do odorant receptors in the human nose recognize a wide variety of scent molecules? The structures of engineered versions of these receptors finally provide much-needed answers to this fundamental question.

[ad_2]

Source link

October 30, 2024
Hybrid protein filaments are a surprise twist in neurodegeneration

[ad_1]

Nature, Published online: 02 October 2024; doi:10.1038/d41586-024-03054-5

Abnormal filaments of a single type of protein are hallmarks of neurodegeneration. Structural studies reveal filaments made from two discrete but interwoven proteins, giving clues about the origin of neurodegenerative conditions.

[ad_2]

Source link

October 2, 2024
enzyme structures offer clues to microtubule control
[ad_1]
- RESEARCH BRIEFINGS
- 11 September 2024
Cellular structural filaments called microtubules are made up of units of tubulin proteins. Various enzymes add modifications to and remove them from tubulin units; together, these modifications form a ‘tubulin code’ that controls microtubule function. Structures of an enzyme called CCP5 reveal how it deforms tubulin tails to recognize and remove modifications called single-glutamate branches.
[ad_2]

Source link
September 11, 2024
Cage-like complexes that protect folding proteins visualized in cells
[ad_1]
- RESEARCH BRIEFINGS
- 11 September 2024
The bacterial chaperonin complex — consisting of the proteins GroEL and GroES — assists the folding of newly synthesized proteins by transiently encapsulating them in a nanometre-scale cage. Visualizing this process using cryo-electron tomography in the intact cellular environment provides insights into the chaperonin reaction cycle in vivo.
[ad_2]

Source link
September 11, 2024
How a bacterial immune system gets taken apart
[ad_1]
- RESEARCH BRIEFINGS
- 03 September 2024
Investigating small protein inhibitors of CRISPR–Cas — an adaptive immune system in bacteria — has led to the discovery of a mechanism for inhibiting a large macromolecular complex. AcrIF25, an anti-CRISPR protein, blocks CRISPR–Cas activity by sequentially extracting all six Cas7 subunits from this complex.
[ad_2]

Source link
September 3, 2024
The biology of smell is a mystery — AI is helping to solve it

[ad_1]

The smell in the laboratory was new. It was, in the language of the business, tenacious: for more than a week, the odour clung to the paper on which it had been blotted.

To researcher Alex Wiltschko, it was the smell of summertime in Texas: watermelon, but more precisely, the boundary where the red flesh transitions into white rind.

“It was a molecule that nobody had ever seen before,” says Wiltschko, who runs a company called Osmo, based in Cambridge, Massachusetts. His team created the compound, called 533, as part of its mission to understand and digitize smell. His goal — to develop a system that can detect, predict or create odours — is a tall order, as molecule 533 shows. “If you looked at the structure, you would never have guessed that it smelled this way.”

That’s one of the problems with understanding smell: the chemical structure of a molecule tells you almost nothing about its odour. Two chemicals with very similar structures can smell wildly different; and two wildly different chemical structures can produce an almost identical odour. And most smells — coffee, Camembert, ripe tomatoes — are mixtures of many tens or hundreds of aroma molecules, intensifying the challenge of understanding how chemistry gives rise to olfactory experience.

Another problem is working out how smells relate to each other. With vision, the spectrum is a simple colour palette: red, green, blue and all their swirling intermediates. Sounds have a frequency and a volume, but for smell there are no obvious parameters. Where does an odour identifiable as ‘frost’ sit in relation to ‘sauna’? It’s a real challenge to make predictions about smell, says Joel Mainland, a neuroscientist at the Monell Chemical Senses Center, an independent research institute in Philadelphia, Pennsylvania.

How do we smell? First 3D structure of human odour receptor offers clues

Animals, including humans, have evolved a remarkably complex decoding system befitting the enormous repertoire of odour molecules. All sensory information is processed by receptors, and odour is no different — except in its scale. For light, the human eye has two types of receptor cell; for smell, there are 400. How the signals from these receptors combine to trigger a particular perception is unclear. Plus, the receptor proteins themselves are hard to work with, so what they look like and how they function has mostly been guesswork.

Things are beginning to change, however, thanks to improvements in structural biology, data analytics and artificial intelligence (AI). Many scientists hope that cracking the olfactory code will help them to understand how animals use this essential sense to find food or mates, and how it feeds into memory, emotion, stress, appetite and more.

Others are trying to digitize smell to build new technologies: devices that diagnose disease on the basis of odours; better, safer insect repellents; and affordable or more-effective aroma molecules for the US$30-billion flavour and fragrance market. At least 20 start-up firms are trying to make electronic noses for applications in health and public safety.

This all adds up to a surge of research into the biology of olfaction, says Sandeep Robert Datta, a neuroscientist at Harvard Medical School in Boston, Massachusetts. “Smell is having a moment,” he says.

Smelling machines

Even for experts, the physical properties of an odour molecule typically offer little insight into how it will actually smell.

Researchers have come up with a few computational models that can relate structure to odour, but early versions tended to be based on quite narrow data sets or could only make predictions when smells had been calibrated to have the same perceived intensity. In 2020, one team reported a model that could predict how similar real-world mixtures were to each other, correctly identifying that rose and violet odorants are more similar to one another than either is to the pungent spice asafoetida, often used in Indian cuisine¹.

Previous attempts to use machine learning were good, but not great. For example, when researchers ran a competition to create the best odour-predicting model, algorithms from 22 teams could effectively predict only 8 out of 19 smell descriptors².

Could rats and dogs detect disease better than the finest lab equipment?

Last year, Wiltschko’s team — then part of Google’s AI research division — collaborated with researchers at Monell, including Mainland, to publish a map for smell³ that made use of AI.

Their program was trained by feeding the model thousands of descriptions of molecular structures from fragrance catalogues, along with smell labels for each — terms such as ‘beefy’ or ‘floral’.

Then, the researchers compared the AI system with human noses. They trained 15 panellists to rate a few hundred aromas using 55 labels, such as ‘smoky’, ‘tropical’ and ‘waxy’.

Humans have a hard time with this task because smell is so subjective. “There’s no universal truth,” says Mainland. Most smell descriptions lack detail, too. For one smell, panellists chose the words ‘sharp, sweet, roasted, buttery’. A master perfumer, asked to describe the same smell, noted ‘ski lodge, fireplace without a fire’. “That shows you the gap,” says Mainland. “Our lexicon is not good enough.” Nonetheless, a human panel is one of the best available tools for coming up with consistent smell descriptors because the average rankings of the group for different smells tend to be stable.

Using the structure of these molecules alone, the AI algorithm did well at predicting the smell of compounds compared with the average group assessments (see ‘Same but different’), and it performed better than the typical individual sniffer. And although the map it produced was very complicated — it has more than 250 dimensions — it was able to group smells by type, such as meaty, alcoholic or woody.

Source: Ref. 3

Mainland says that the algorithm’s thoroughness helped it to perform. Humans might rate an odour as fruity but forget to rate it as sweet. The model, exhaustive and patient, churns through all possibilities each time.

One challenge that both Mainland and the Osmo team are working on now is to work out whether the model can predict what mixtures of compounds smell like, on the basis of their components. Another goal is to have the model design new odours, for instance chemicals that mimic a specific scent, or that are safer, more sustainable or biodegradable.

AI probably can’t do this alone, says Jane Parker, a flavour chemist at the University of Reading, UK, who helped the smell-mapping team with quality control of their compounds. “The model could give you an idea of what might work,” she says. But the expertise and ingenuity of human chemists and flavourists — plus their highly trained noses — will still be necessary for innovation.

Mysterious code

For both expert and amateur sniffers, the biological equipment for smell detection is the same. The nose has millions of olfactory neurons, and each typically expresses just one type of odorant receptor (OR). The gene family that encodes them was discovered⁴ in the early 1990s and won Linda Buck and Richard Axel a Nobel prize in 2004.

Each of these receptor types might recognize one or more odorants — and each odorant might be recognized by more than one receptor. Together, the roughly 400 human ORs can respond to a trillion different chemicals. It’s a fiendishly complex, exquisitely tuned, flexible system — and it needs to be, because the chemistry of nature is incredibly diverse, says Aashish Manglik, a biochemist at the University of California, San Francisco. “The breadth of chemicals that make odours is enormous.” One important step in cracking the code of smell is to know what the receptors look like and how they recognize chemicals. But they have been notoriously difficult to study. “They’re the most recalcitrant membrane proteins to work with,” says Manglik. Many are too unstable to be expressed in cells in the lab and to generate enough protein to analyse.

Scientists have deciphered the structure of two ORs from insects⁵^,⁶. These receptors are of a totally different type to those in mammals, although the olfactory ‘logic’ by which they work together is likely to be similar, says sensory neuroscientist Vanessa Ruta, whose lab at the Rockefeller University in New York solved both the structures.

Two more receptor structures⁷^,⁸, from the olfactory system in mice, followed last year. Both of them sense a bunch of chemicals with distinctly unlikeable fishy, musky or putrid odours, many of which are key components of bodily odours in animals.

Getting at these structures has required some “funky approaches”, says Manglik, because ORs are so hard to grow in the lab. But last year, he was part of a team that succeeded in publishing the first protein structure of a human olfactory receptor bound to an odorant⁹.

Having tried almost every OR that they could, Manglik and colleagues found one that is richly expressed outside the nose, in the gut and prostate, and which could, as a result, be made more readily in commonly used cell lines. It’s a receptor called OR51E2 and it responds to the chemical propionate, which has a pungent, cheesy odour.

The olfactory receptor cell (orange) is a neuron with cilia (red) that reach into the nasal cavity.Credit: Prof. P. Motta/Dept. of Anatomy/University “La Sapienza”, Rome/SPL

Using cryo-electron microscopy, the team looked at how propionate binds to the receptor in a little pocket, and how that binding changes the receptor’s shape and conveys information onwards. Seeing the structure “was really thrilling”, says Buck, whose lab at the Fred Hutchinson Cancer Center in Seattle, Washington, studies olfactory neuroscience.

But ORs can detect so many odorants that “the structure of one OR can’t tell us much”, says Hiro Matsunami, an olfaction biologist at Duke University in Durham, North Carolina, who collaborated with Manglik on the study of OR51E2.

Alongside trying to grow more of them, Matsunami and his colleagues have attempted to understand the OR by re-engineering it. They made some synthetic receptors using OR51E2 and parts from two dozen similar receptors. They aligned the amino-acid sequences of these existing ORs, and chose the most frequent amino acid at each position to build an average, or ‘consensus’, structure. Then they expressed the structure in cells. When they compared their synthetic structure with its real-life counterpart, OR51E2, it looked and behaved just like its sibling¹⁰.

Next, they tried building another averaged receptor based on an OR with no published structure — OR1A1 — which recognizes a broad range of odorants including some that smell fruity, floral and minty. They used a computational model to explore how it bound to two compounds that both smell of menthol; the compounds bound to the receptor in different places.

The team thinks that different odorants probably engage a single OR in distinct ways. That would help to explain the level of complexity in the smell code — and could explain why, for example, two disparate chemicals can have similar odours, or why chemically similar compounds can smell so different. The compound carvone, for instance, comes in two varieties that are mirror images of each other; one smells of spearmint, the other of caraway or dill. “There must be a receptor that can explain this,” says Matsunami.

Some researchers are using machine learning to accelerate the search for structures and their preferred chemical partners. Right now, scientists have identified odour molecules that bind to only about 20% of human ORs.

The protein-prediction algorithm AlphaFold has suggested thousands of structures for mammalian odorant receptors¹¹. And machine learning and modelling has helped Matsunami and his colleagues to screen millions of compounds to see which ones might bind to two candidate OR structures¹². One of the molecules they found smells of orange blossom; another strongly of honey.

The dream end point is to gather data on hundreds of ORs and how their activation lines up with the chemistry of millions of odorants, says Manglik.

Lead with the nose

Once a smell is processed by receptors, this information proceeds to a brain region called the olfactory bulb, which sits behind the bridge of the nose, and onwards to the olfactory cortex. The circuitry for olfaction before information enters the cortex is well understood, particularly in model organisms such as fruit flies and mice. But the olfactory cortex is more of a mystery. “It’s difficult to figure out what’s happening there,” says Buck.

Many researchers want to understand how information from receptors is organized in the brain, and what rules govern perception. If that were understood, it might be possible to get an animal to perceive a certain smell without even presenting an odorant, simply by recreating the pattern it generates in the brain, says Dima Rinberg, a neuroscientist who studies smell at New York University School of Medicine.

Human nose can detect 1 trillion odours

Another big unknown, Datta says, is how the olfactory system interacts with other crucial brain circuits, such as those that control movement or navigation. Several labs, including his own, are interested in how animals actively sense scent and move towards or away from odours.

Capturing the connection between scent and behaviour is already possible to some extent in the brains of insects. In fruit flies, for instance, scientists can explore chemical structure, receptors and the brain in a single system. “In insects you can start to span the whole spectrum,” says Ruta.

Insects’ sense of smell is also relevant to human health. Mosquitoes evolved to sniff out humans, and many insects prey on crops that humans rely on. Last November, Osmo announced a $3.5-million grant from the Bill & Melinda Gates Foundation in Seattle, Washington, aimed at discovering and producing compounds that repel, attract or destroy insects that carry disease.

Meanwhile, detecting scents is also big business. For some tasks and applications, ‘electronic noses’ are already commercially available: some are designed to detect off odours in food or to pick up smells in waste water. They are being intensively studied as diagnostics for diseases such as TB, diabetes and various cancers.

But natural sniffers still have an edge, and even without a full understanding of how brains process smell, scientists can exploit biological noses to improve chemical sensing for safety, security or health care.

The classic example is the sniffer dog, widely used to sense chemicals in explosives or narcotics — but these animals are expensive to train and there are limits to what they can detect.

Rinberg’s team is aiming to blend animal and digital odour detection. They developed a nose–computer interface in mice¹³, using electrodes that record signals from the olfactory bulb as mice smell different compounds. The researchers can decode odour identities from neural activity, and then use the patterns to flag these smells under natural conditions. Their device, which is now being developed by a start-up called Canaery, co-founded by Rinberg, retains the precision of the animal’s sense of smell without researchers having to train the animal to respond. “The biological nose is the best chemical detector,” Rinberg says. “The whole machinery is hard to beat.”

Despite the pre-eminence of biology, many scientists dream of a time when a digital smell sensor will rival those for other senses. “Smartphones can do image and audio recognition,” says Ruta. “But for olfaction there’s nothing like that.”

And although they know how well biological noses work, researchers still have a lot of outstanding questions. For Buck, the simplest to pose could be the most difficult to answer. “It would be nice to know how you get a perception of a particular odorant,” she says — how the brain, beyond the nose, creates a sense of rose, for instance, and how it distinguishes it from the essence of fish. “How does that happen in the brain? No one knows,” she says. “We don’t have the techniques to figure that out yet.”

[ad_2]

Source link

September 3, 2024