Nicheformer: A Foundation Model for Single-Cell and Spatial Omics

A new transformer foundation model trained on 110M cells reconstructs tissue architecture from dissociated single-cell data using spatial transcriptomics.

Single-cell RNA sequencing has revolutionized how we study individual cells, but it comes with a fundamental trade-off: once cells are dissociated from tissue, their spatial organization is lost. Researchers know that cell identity and function depend on neighborhood context, yet most computational tools treat each cell as isolated. Nicheformer: a foundation model for single-cell and spatial omics, published in Nature Methods in December 2025, addresses this problem directly. The model learns to predict where in a tissue a cell should belong based purely on its transcriptome, effectively recovering spatial structure that dissociation erased.

The Problem: Spatial Context Lost in Dissociation

Single-cell RNA-seq (scRNA-seq) has generated massive catalogs of cell states. Yet dissociation destroys tissue architecture, creating a data bottleneck: researchers have millions of dissociated cells sequenced but far fewer spatially resolved samples from the same tissues. This asymmetry limits our ability to understand how transcriptional programs depend on tissue context.

Spatial transcriptomics methods like 10x Visium and MERFISH capture spatial information directly, but they are expensive, labor-intensive, and cover fewer cells than dissociation-based approaches. The practical bottleneck is this: we have abundant scRNA-seq data from dissociated cells, but spatial data from only thousands to millions of carefully selected regions. Can computational models learn to restore spatial information from dissociated data?

The Approach: Pretraining on SpatialCorpus

The Nicheformer model is built on a straightforward but ambitious idea: train a transformer on paired scRNA-seq and spatial transcriptomics data, then use that learned representation to infer spatial structure for purely dissociated cells.

The researchers assembled SpatialCorpus-110M, a curated collection of over 110 million cells:

  • 57 million dissociated single cells from public datasets
  • 53 million spatially resolved cells from targeted spatial transcriptomics studies
  • Coverage spanning 73 human and mouse tissues

The model architecture uses a transformer encoder to process gene expression profiles and learn a latent representation that captures both transcriptional state and spatial context. During pretraining, the model sees pairs of (single cell, spatial coordinate) across diverse tissues and cell types, learning what patterns of gene expression correspond to specific spatial locations.

What Nicheformer Can Do

Once trained, Nicheformer enables three key capabilities:

1. Spatial label prediction: Given a dissociated cell’s transcriptome, predict which tissue region or cell neighborhood it came from. The model can output a spatial map showing where in tissue space an arbitrary set of scRNA-seq cells would be expected to reside.

2. Spatial composition prediction: For a given tissue location, predict the composition and relative abundance of cell types that should be present, effectively reconstructing local cellular neighborhoods.

3. Transfer of spatial context to scRNA-seq: By fine-tuning on paired dissociated and spatial data from a new tissue, Nicheformer can learn tissue-specific spatial organization and apply it to larger dissociated cohorts, multiplying the value of expensive spatial experiments.

The model outperforms baselines trained only on dissociated data. As the authors note, models lacking spatial training fundamentally fail to capture the complexity of microenvironments. This is a critical insight: dissociated-only models cannot learn what they have never seen.

Implications for Computational Biology

For researchers, Nicheformer offers practical advantages. If you have large scRNA-seq cohorts but limited spatial data, you can now impute spatial context computationally. This is particularly valuable for disease studies where spatial heterogeneity is functionally relevant (e.g., tumor microenvironment composition, immune cell positioning in inflamed tissues).

The work also represents a broader trend: foundation models trained on massive, curated datasets are becoming standard tools for genomics. Like pretrained language models in NLP, these models encode general principles of transcriptional organization that transfer across tissues and species. Fine-tuning on a specific tissue or condition is often more efficient than training from scratch.

Nicheformer is the first large-scale foundation model integrating single-cell and spatial data at this scale. The model is made public, enabling researchers to apply it without retraining.

Limitations and Caveats

Nicheformer’s predictions are aggregate patterns, not ground-truth spatial coordinates. Predicting whether a cell belongs to the core or edge of a tissue region is different from predicting its exact location. The model captures broad spatial structure (e.g., epithelial layers, immune infiltration zones) better than fine-grained positioning.

The training set is biased toward well-studied tissues. Rare tissues or cell types underrepresented in spatial transcriptomics may receive less reliable imputations. Spatial organization also varies across disease states and conditions; a model trained on healthy tissue may not accurately predict coordinates in diseased tissue.

The model assumes that transcriptional signatures of spatial context are consistent across donors and conditions. This is reasonable for stable anatomical structure but may not hold in tissues undergoing active remodeling (e.g., development, wound healing, acute inflammation).

Finally, Nicheformer is trained on current spatial transcriptomics methods (Visium, MERFISH, etc.). As new spatial technologies emerge with different resolution and coverage, retraining may be necessary to capture their specific spatial signatures.

Technical Considerations for Users

Nicheformer is a general model; it works best when your tissue is represented in the training data. If you are working with a rare tissue or an unusual condition, linear probing (freezing the pretrained weights and training only a small classifier layer) may outperform fine-tuning.

The model processes cells independently, which means it does not account for direct cell-cell interactions or signaling at inference time. It predicts position based on intrinsic transcriptional state, not neighbor communication. This is a design choice, not a limitation, but it is worth understanding when interpreting results.

Integration with other spatial methods is straightforward. You can combine Nicheformer predictions with experimental spatial data to boost coverage or validate computational imputations.

Why This Matters

The field of single-cell biology has produced mountains of transcriptome data from dissociated cells. That data is valuable but incomplete without spatial context. Nicheformer begins to close that gap, offering a practical way to recover tissue organization from dissociated sequencing.

For postdocs and PhD students, this is a tool to add to your toolkit: if you have scRNA-seq data and need to understand spatial organization, try Nicheformer before investing in expensive spatial experiments. It will not replace spatial transcriptomics, but it is a useful starting point for hypothesis generation.

The broader lesson is that pretrained, publicly available foundation models are becoming standard infrastructure in computational biology, much like reference genomes. Nicheformer is one of several emerging models in genomics that encode generalizable principles across tissues and organisms.

If you are working with single-cell data and spatial organization matters to your question, get weekly research summaries and computational biology tools delivered to your inbox.

Source and Further Reading

Nicheformer: a foundation model for single-cell and spatial omics. Lotfollahi, M., Wolf, F. A., et al. Nature Methods 22, 2525-2538 (2025). DOI: 10.1038/s41592-025-02814-z