The Finding in Plain Terms
Researchers at Synthesize Bio and collaborating institutions built GEM-1 (Generate Expression Model-1), a generative AI system trained to predict the outcomes of gene expression experiments before they’re run in the lab. The model was trained on hundreds of thousands of real RNA sequencing datasets and learned to capture the patterns of how genes behave under different biological conditions.
The key result: when they asked GEM-1 to predict gene expression changes from experiments performed after its training data cutoff, the model’s predictions matched actual lab results with accuracy comparable to what you’d expect when repeating the same experiment twice in a wet lab. This is not incremental improvement. This is a computational system that performs like matched biological replicates.
Why This Matters
Gene expression prediction is one of the oldest unsolved problems in computational biology. For decades, we’ve been able to sequence genomes but predicting which genes will be turned on or off in response to a specific perturbation, disease state, or drug exposure has remained largely guesswork, despite massive investments in machine learning.
The practical bottleneck this solves is real: gene expression experiments are expensive, time-consuming, and yield data that is noisy and variable even when executed perfectly. A researcher studying cancer drug resistance, for example, might need to run RNA-seq on dozens of patient samples or cell line variants to understand which therapeutic targets are viable. If an AI model could predict those outcomes before expensive experiments, it would collapse both the cost and timeline of research.
GEM-1’s accuracy suggests a deeper insight: there is enough signal in publicly available gene expression data that a large generative model can learn the rules governing how cells respond to perturbations. This is analogous to how large language models learn language structure from raw text, except here the “language” is the logic of gene regulation.
How They Built It
The team trained GEM-1 on approximately 500,000 bulk RNA-seq and single-cell RNA-seq experiments from public repositories, including data from GEO (Gene Expression Omnibus), ArrayExpress, and cell atlas projects. The model was structured as a conditional diffusion model, which generates predicted expression values iteratively based on experimental metadata (cell type, perturbation, timepoint, organism, etc.).
Crucially, they validated the model on data deposited after the training cutoff, which means the experiments the model predicted were genuinely held-out: biologically real experiments the AI had never seen. They compared GEM-1’s predictions to actual RNA-seq results using standard metrics (correlation, rank correlation, and mean absolute error) and estimated statistical upper bounds for accuracy by comparing technical replicates from the same experiments in the training set.
The results showed GEM-1’s predictions fell within the range of technical variation observed in real replicate experiments. For genes that were robustly regulated (high signal), the model achieved particularly strong predictive power. This is important because it means the model isn’t overfitting to low-signal noise; it’s learning genuine biological relationships.
Limitations and Caveats
This is a preprint, not yet peer-reviewed. That’s the first and most important caveat. Preprints can contain errors, and independent validation by other labs is still needed before claims about real-world utility should be acted on.
Second, the model’s accuracy varies by context. It performs best on standard model organisms and well-characterized cell types where lots of public data exists. For rarer cell types or novel perturbations that are underrepresented in public databases, accuracy will be lower, especially when extrapolating beyond the range of training data.
Third, the model predicts relative expression changes, not absolute abundance. It tells you which genes go up or down, not the actual concentration or counts of RNA molecules. This is sufficient for many downstream applications, but not all.
Fourth, GEM-1 was trained on static snapshot experiments. It does not yet model dynamic time-series responses or predict how cells behave in complex, multi-step experimental protocols. Predicting the outcome of a 30-day differentiation protocol is harder than predicting a single timepoint.
Finally, the model is a powerful tool for hypothesis generation and prioritization of experiments, not a replacement for wet lab work. A predicted result still needs validation in the actual biological system of interest.
What This Means in Practice
For computational biologists and bioinformaticians, this work has several implications:
Accelerated hypothesis generation. Instead of running exploratory RNA-seq experiments to narrow down which perturbations or drug candidates to investigate further, you could use GEM-1 (or similar models as they become available) to simulate dozens of scenarios and prioritize only the most promising ones for wet lab validation. This could cut experiment timelines from months to weeks.
Better experimental design. Before running a resource-intensive study, you could use the model to predict which conditions are most likely to yield clear, interpretable results. This is especially valuable for clinical researchers working with limited patient samples.
Integration into existing pipelines. As similar generative models mature and are published, they’ll likely be incorporated into standard bioinformatics workflows via containers and package managers. The path exists for GEM-1 or its successors to become as routine as DESeq2 for differential expression analysis.
New benchmarking standards. This work raises an important question: what should accuracy look like for a computational model of biology? The paper’s comparison to technical replicates sets a clear, empirically grounded target. Future models will be compared against this standard.
Bottom Line
Koytiger et al., 2025, bioRxiv demonstrates that generative AI models trained on public genomics data can predict gene expression outcomes at near-replicate accuracy. This is genuine progress on a hard problem. The work is early (preprint, needs validation), and deployment will require careful consideration of when predictions are trustworthy and when they require wet lab confirmation. But the results suggest that computational modeling of cellular behavior is moving from “interesting prototype” to “practical tool.”
The implications ripple outward: faster discovery cycles, lower research costs, and new ways to prioritize experiments before resources are committed. For a field accustomed to treating computational predictions as rough guides, this level of accuracy represents a meaningful inflection point.