The AI Chemist: To be trustworthy, LLMs need to show their work

Travel

June 16, 2026

The AI Chemist: To be trustworthy, LLMs need to show their work

Introducing the AI Chemist

Artificial intelligence and large language models offer promising methods to interpret vast amounts of data but also more than a few cautions. This C&EN column will cover what the technologies can do now, what they could do in the future, and what they shouldn’t tackle—all written by expert contributors.

Drug discovery is really, really hard, and most drug candidates fail: humans are variable, the animals we use for testing aren’t humans, pharmacokinetics and pharmacodynamics are hard to predict, and unsuspected off-target effects cause toxicity. The emergence of artificial intelligence and machine learning tools such as AlphaFold has raised excitement around the potential to accelerate early-stage drug discovery. Even AI skeptics, who professionally criticize the utility and ethics of ChatGPT and other large language models (LLMs), will often say, “But of course, AlphaFold is helping cure cancer, so it’s not all bad.”

I’m not sure I agree that software like AlphaFold is an exception.

Computer-aided drug design (CADD) uses models of proteins and chemical compounds to prioritize compounds for investigation as possible drugs. This approach accelerates the early stage of drug development, as it helps focus attention on the likeliest candidates while considering frankly enormous numbers of candidates—as many as trillions of compounds. To be fair, medicinal chemists have always considered protein structure when designing drugs, but the tools and the availability of protein structures were more limited in the past. Over the past 50 years, slowly (and then ever more quickly), the computational tools addressing docking, molecular dynamics, free-energy perturbation calculations, and single-point quantum mechanics calculations have improved. The computing power available to run these calculations has improved, and so has the availability of experimentally confirmed protein conformations. But both the use of these tools and the acquisition of the data they rely on require a lot of expertise.

When AlphaFold arrived, suddenly all structures of all proteins were available to everyone at the click of a button. Then LLM-guided docking arrived, which simplified protein-ligand CADD docking, greatly democratizing the screening of protein-ligand interactions. While I was writing this column, a high school student approached me and my colleagues, asking us to help them test some proposed drug candidates they had identified though “vibe CADDing,” performing CADD by conversing with an AI assistant. You can do all the steps of CADD without the collaboration of a bunch of PhDs from different disciplines and without having studied any organic or physical, let alone quantum or medicinal, chemistry at all.

The problem is that in practice, conducting these studies still does require extensive collaboration. A structure from the Protein Data Bank (PDB) is a single frozen conformation of a highly dynamic protein. You can’t use that structure directly in a CADD study without considering dynamics, the protein’s inherent natural environment, the influence of the ligands that the protein was soaked with as a way to reduce motion enough to make a crystal or get a good cryo-electron microscopy ensemble, and more.

Structural biologists know these considerations: many proteins have dozens (or even hundreds) of different entries in the PDB. Those entries exist because the structures are not all the same. Computational, all-atomic simulations can be used to model some of this dynamism back in, but the assumptions used in these calculations are abstractions of physical reality. And when we assume . . . that leads to weird errors for you and me. But if you do this process regularly, you know what to look for and how to validate your models.

If a protein is not present in the PDB, you can use homology modeling to try to recreate a reasonable 3D structural ensemble. Because that model is built on shaky ground, validating it against your own integrated experimental and computational data can take highly skilled practitioners months or years.

In contrast, AlphaFold gives me a structure in seconds! Using it is so much more efficient. But my team often sees AlphaFold make the very human mistake of imposing structural order where there is only chaos. Proteins with long unstructured domains that flop around a lot can’t be captured using our current techniques. So these parts of (or whole) proteins aren’t in the PDB datasets used to train AlphaFold. A model of roads trained on only grid cities won’t properly predict the street plans of Rome or Boston. Having the available data biased toward a well-organized structure is a problem. But it isn’t the big one.

The main problem is that I don’t know how the program got the structure. I can’t track the process back to check the assumptions, the steps, or the models that went into the structure generation. I can’t check if it is right. Getting a good protein model (note, not a structure; almost all useful CADD models are a collection of individual conformations inclusive of a dynamic modeling component) is the most important step in CADD.

Good science requires a “chain of custody of why” connecting input data and observation to an output conclusion. Every step in the logic chain must be auditable by our peers and those who want to use our work. We must be able to challenge and potentially falsify every decision point so that when something inevitably goes wrong, we can try to figure out where we went wrong. Protein-structure models don’t allow for this.

Consequently, I don’t use AlphaFold—or any similar LLM—to generate structures. I made that decision because the only way to validate what the programs give me is to use the “old fashioned” (that is, circa 2020 but with our better computing tools in 2026) protein-preparation approach to check. The result is interesting if the program’s product is more or less accurate than mine. It is interesting if it is completely wrong, and the program is leading others down the garden path of futility.

I’m certainly not saying these models are always wrong. That suggestion is far from the case. The problem is they aren’t always right. And if you are relying on them for CADD (rather than as pretty pictures), they need to be. The old way will be wrong sometimes. But you will know why you are wrong when the data start coming back, and you will be able to adjust because you have your chain of custody of why. With an LLM, there is only the one step. If you can’t examine the “why?” at every step of the chain, you can’t determine if your hypothesis is realistic and worth spending large amounts of money to check out, and you can’t tell if you would be better served by hosting a giant bonfire party using your investors’ cash as fuel. And that situation makes me nervous. Nervous enough not to use an LLM.

Credit:
Courtesy of John Trant

John Trant is an associate professor and a Faculty of Science research chair at the University of Windsor, where he leads a team of interdisciplinary scientists tackling societally relevant molecular problems.

Views expressed are those of the author and not necessarily those of C&EN or the American Chemical Society.

Do you have a view about using AI in chemistry that you’d like to share with our readers? Please email editor Chris Gorski at [email protected] to propose a column that you’d like to write.

Source link