Can DNA Foundation Models Beat Specialized Genomics Tools?

If you’ve been following AI in computational biology, you’ve heard the hype: foundation models trained on DNA sequences could transform genomics. The pitch is simple: train once on massive datasets, use everywhere. But does the promise hold up in practice?

A new benchmark from researchers at MD Anderson and collaborators set out to answer exactly this question. Rather than relying on vendor claims, they tested five DNA foundation models side-by-side on real genomic tasks: classifying pathogenic variants, predicting gene expression, identifying causal genetic variants, and mapping regulatory regions. The results are nuanced and practical in ways most AI papers are not.

What They Actually Tested

Feng et al., 2025, Nature Communications benchmarked five models:

DNABERT-2: A transformer pre-trained on human genomic sequences
Nucleotide Transformer V2: DeepMind’s large-scale language model for nucleotide sequences
HyenaDNA: A state-space model architecture designed for long sequences
Caduceus-Ph: A bidirectional state-space model variant
GROVER: Tencent’s graph-based pre-training approach

They didn’t just evaluate on toy benchmarks. Instead, they tested using zero-shot embeddings (meaning no task-specific fine-tuning) on four realistic genomic problems:

Sequence classification: Detecting pathogenic variants in coding regions
Gene expression: Predicting expression levels from DNA sequence
Quantitative trait loci (QTLs): Identifying genetic variants associated with molecular traits
Topologically associating domains (TADs): Mapping 3D chromatin structure regions

This is what matters in real research labs.

The Key Finding: Mean Pooling Wins

The most striking result was unexpected: how you aggregate token embeddings matters far more than which model you choose. When the researchers tested different pooling strategies, mean pooling of all token embeddings consistently and significantly outperformed max pooling or other methods across tasks.

This is important because it’s practically actionable. If you’re planning to use a DNA foundation model, you should default to mean pooling unless you have domain-specific reasons not to.

Why does this matter? Foundation models work by converting DNA sequences into high-dimensional embeddings (numerical representations) for every token (typically 4-mers or similar units). To make a prediction at the sequence level, you need to aggregate all those token-level representations into a single vector. The naive approach (max pooling) captures only the most extreme activation, losing nuance. Mean pooling treats all positions equally, preserving information across the entire sequence. The paper shows this simple choice has larger practical impact than swapping between models entirely.

Where Foundation Models Shine and Fail

The benchmark revealed clear task-dependent strengths and weaknesses:

Strong performers: For pathogenic variant detection, the general-purpose DNA foundation models were competitive with specialized prediction tools. DNABERT-2 and Nucleotide Transformer V2 performed well on sequence classification tasks, suggesting that general genomic pre-training does capture information about variant pathogenicity. This is practically significant because variant pathogenicity prediction remains a routine task in clinical genomics and research labs analyzing whole genome sequencing data.

Significant gaps: Foundation models struggled with gene expression prediction and QTL identification compared to specialized methods. This matters because expression prediction is a core task in genomics, and QTL mapping is central to understanding disease genetics. When the models tried to predict gene expression levels from DNA sequence alone, specialized methods still won decisively. Similarly, for identifying genetic variants associated with molecular traits (QTLs), the foundation models lagged.

The researchers were direct about why: general-purpose models are not optimized for regulatory and expression dynamics. They learn sequence composition and conservation signals, but not tissue-specific regulatory codes. A foundation model sees millions of sequences during pre-training but has no direct signal about which ones produce high expression in liver versus brain. A specialized expression prediction model trained on tissue-specific data captures those patterns explicitly.

Limitations and What They Mean

This was a zero-shot benchmark, meaning the models were tested without any task-specific training. In practice, researchers can fine-tune foundation models on small amounts of labeled data, which would likely improve performance on expression and QTL tasks. The benchmark doesn’t tell you what you can achieve with fine-tuning. If you have 1,000 cell lines with both expression data and genomic sequences, fine-tuning a foundation model might close the gap with specialist methods. The paper tested only the out-of-the-box behavior.

Additionally, the benchmark was limited to human genomic tasks. Results may differ substantially for other organisms, ancient DNA, or non-standard genomic data. If you work with bacterial genomes, plant genomics, or other model organisms, you cannot directly apply these conclusions without testing on your own organism.

The study also didn’t evaluate computational cost or inference speed. Nucleotide Transformer V2 is substantially larger than HyenaDNA; if you’re processing whole genomes or large variant sets, model size matters practically even if accuracy is equivalent. Inference speed and memory requirements are often deal-breakers in production pipelines, yet this benchmark excluded them. A slower, more accurate model may not be usable if you need to score millions of variants in a clinical setting.

The Bottom Line

DNA foundation models are worth evaluating for sequence classification and variant annotation tasks, where they match or exceed specialized tools out of the box. For gene expression and regulatory prediction, specialized models remain superior unless you have the data and compute to fine-tune a foundation model effectively.

The practical implication: don’t assume “foundation model” means “universally better.” The task determines whether a general approach beats a specialist one. For new genomic prediction problems you haven’t seen before, a foundation model may be your best starting point. For well-defined problems like variant pathogenicity scoring, they’re now competitive alternatives to SIFT and similar tools.

Source and Further Reading

Feng, H., Wu, L., Zhao, B., et al. Benchmarking DNA foundation models for genomic and genetic tasks. Nature Communications 16, 10780 (2025).

Full text available via PubMed Central.