Chemists ran 50,688 reactions to make a huge open dataset

Travel

June 18, 2026

Chemists ran 50,688 reactions to make a huge open dataset

Artificial intelligence tools like AlphaFold have transformed the landscape of protein design in less than a decade. Meanwhile, organic chemists are still waiting for their own “AlphaFold moment” for predicting reaction recipes and routes for small-molecule synthesis.

Part of the issue for reaction prediction models, says medicinal chemist Tim Cernak, is that the available training data don’t paint a comprehensive picture of all of the variables that could affect a reaction.

Most studies in the literature look at how a bunch of different starting materials will react under one set of optimized reaction conditions. That’s great for predicting how well a particular reaction protocol will work on a given molecule. But it’s less good for forecasting the outcome of changes to a reaction protocol: for instance, what modifications might be necessary to get a coupling reaction to work with a first-row transition-metal catalyst instead of palladium. Given that the world’s palladium supply is largely controlled by Russia, being able to easily identify alternatives in case of trade disruptions is important to the pharmaceutical supply chain.

Cernak and his team focused on systematically sampling a large number of reaction conditions for a limited number of starting materials. They ran 33,792 variants on this pair alone.

The reaction recipes in the literature for different catalyst metals don’t overlap much, so it’s difficult to translate between them, says Cernak. To fill in some of those gaps, Cernak and his team at the University of Michigan used ultra-high-throughput automation to create a dataset of 50,688 recipes for C–N coupling, one of the most common metal-catalyzed reactions used in pharmaceutical synthesis (J. Am. Chem. Soc. 2026, DOI: 10.1021/jacs.6c05959).

The dataset, which the researchers have made available on the Open Reaction Database, is nearly five times as large as the next largest C–N coupling dataset. Cernak says that his team explicitly designed it to maximize direct comparisons between palladium, nickel, and copper catalysts.

“Initially, we just took a 1,536-well plate, loaded it up with palladium catalyst, and then just copied the plate and printed again with copper, and then copied another one and ran it again with nickel,” Cernak says.

Over the course of about a year, the researchers screened thousands of combinations of 33 metal catalysts, 166 ligands, 17 bases, 4 solvents, and 3 reaction temperatures on just two pairs of starting materials and analyzed the products and yield of each combination using ultraperformance liquid chromatography/mass spectrometry.

“There were tons of new engineering innovations we had to put in place” to be able to run such a high volume of coupling reactions, Cernak says. For example, high-throughput systems struggle with dispensing solids, so the researchers delivered the bases in aqueous solution and then removed the water under vacuum before adding in the organic components. They heated the reactions using thermocyclers usually used for polymerase chain reactions.

The researchers used their collected data to identify phosphine and N-heterocyclic carbene ligands that tend to work well with multiple metals. They also found conditions where C–N coupling happens even without a metal present if the base is strong enough, hinting at a base-catalyzed mechanism.

“From my point of view, it’s super useful,” says Philippe Schwaller, who researches machine learning for chemistry at the Swiss Federal Institute of Technology, Lausanne (EPFL). “Typical reaction datasets are kind of wide and shallow,” whereas this one is narrow and deep, making it a unique and valuable resource for insight on reaction prediction and optimization.

Fifty thousand reactions seems like a lot, Cernak says, but it’s “still really nowhere near enough data to predict anything new.” AlphaFold was made possible only through decades of the protein community standardizing their data and benchmarking their models. Predictive models for catalysis will also take a big collective effort.

“We’re excited to see what algorithms come from this data, and what things we missed,” Cernak says. He and his team have several more data drops planned, focusing on other commonly used catalytic reactions. “We really hope that the data talks to the literature so that we can have our AlphaFold moment.”