Introduction
You’ve found a published RNA-seq study that you want to reproduce, or you need a reference dataset for a new analysis. You know the data is out there in a public repository, but navigating NCBI GEO and the SRA (Sequence Read Archive) feels like deciphering a map written in a language you’ve never seen before. Accession numbers, run IDs, sample metadata scattered across four different interfaces, and downloads that stall after hours of waiting. The data is free and public, but accessing it shouldn’t feel this broken.
By the end of this tutorial, you will have downloaded a complete RNA-seq dataset, retrieved its metadata, verified the files, and be ready to begin your pipeline. You’ll know exactly which tools to use, what to do when downloads fail, and how to script the whole process so you never have to hunt for a button again.
Prerequisites
Before you start, make sure you have the following on your system:
- A conda or mamba environment (not sure? Read “How to Set Up a Bioinformatics Environment with Conda and Mamba” first).
- At least 50 GB of free disk space. Many RNA-seq datasets are 10-50 GB raw; keep extra buffer.
- SRA Toolkit installed via conda (we’ll cover this).
wgetorcurlinstalled (most Linux/macOS systems have these by default).- A terminal / command line you’re comfortable working in.
No programming experience is required. All the commands are standard bash and will be shown in full.
Finding Data on GEO
GEO (Gene Expression Omnibus) is the easiest entry point. It’s a curated database where researchers deposit their processed and raw data. Every dataset gets a GSE (GEO Series) number.
Searching for Your Dataset
Go to https://www.ncbi.nlm.nih.gov/geo/ and search for your study. You can search by:
- The first author’s name
- A keyword from the title (e.g., “acute leukemia single-cell”)
- A disease or cell type (e.g., “breast cancer”)
- The GEO accession directly (if you already have it, like GSE12345)
Let’s work through a real example: GSE241155, a 2024 study on induced pluripotent stem cells (iPSCs) differentiating into neurons. This is a good teaching dataset because it’s recent, well-annotated, and reasonably sized.
Navigate to: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE241155
Reading a GEO Series Page
On the GSE page, you’ll see:
- Series title and description at the top
- Overall design — what the study compared
- Sample table — a list of all the GSM (GEO Sample) numbers with descriptions
- Supplementary file section — processed files (often DESeq2 tables, count matrices)
- SRA information — link to the raw sequencing runs
The sample table is where you find the GSM numbers. Each GSM is one sample (e.g., “iPSC_day0_rep1”, “neuron_day30_rep2”).
From GSM to SRR: Finding the Run Accessions
In the sample table, each GSM row has an “SRA” link. Click it. This takes you to the SRA Run Selector, which lists all the sequencing runs (SRR numbers) for that sample.
An SRR (SRA Run) is the actual sequencing file. One sample (GSM) can contain one or more runs if it was sequenced across multiple lanes or flow cells.
On the SRA Run Selector, you’ll see a table with:
- Run — the SRR number (e.g., SRR12345678)
- Bases — total number of sequenced bases (tells you coverage)
- Spots — total number of reads
- Library layout — SINGLE or PAIRED
- Platform — usually Illumina
Tip: You can select multiple runs and download them in bulk. At the bottom of the Run Selector table, there’s an “Accession List” button that exports all selected SRR numbers as plain text. Download this; you’ll use it later.
For GSE241155, there are 24 samples. The SRR numbers range from SRR25401000 to SRR25401023. Save that list.
Downloading with SRA Toolkit
The SRA Toolkit is the official tool for downloading from the Sequence Read Archive. It’s maintained by NCBI and handles the complexity of different file formats and compression schemes.
Installing SRA Toolkit via Conda
Create a new conda environment or add to an existing bioinformatics one:
conda install -c bioconda sra-tools
Verify installation:
fastq-dump --version
You should see version output, typically 2.11.x or later.
Configuring SRA Toolkit (Critical Step)
Before you download anything, configure the toolkit to use a cache directory and set permissions correctly. This is where most new users hit trouble.
vdb-config --interactive
This opens a configuration menu. Navigate to:
- Press “1” for “Cache”
- Set “Repository root” to a path with plenty of space (e.g.,
/home/you/sra_cache/) - Enable remote access if using a networked drive
- Press “Save” and exit
Alternatively, you can set this via environment variable:
export VDB_CONFIG=$(pwd)/vdb_config
Downloading with prefetch
prefetch downloads the .sra file to your cache. It’s slower than direct FASTQ conversion but handles errors better.
prefetch SRR25401000
For multiple runs, create a text file with one SRR per line (call it accessions.txt):
SRR25401000
SRR25401001
SRR25401002
SRR25401003
Then download in parallel using GNU parallel (if installed):
cat accessions.txt | parallel prefetch
Without parallel, use a simple loop (slower, but reliable):
while read accession; do
prefetch "$accession"
done < accessions.txt
Expected output after a successful prefetch:
2025-10-14T12:34:56 prefetch.2.11.1: 1) Downloading 'SRR25401000'...
2025-10-14T12:35:02 prefetch.2.11.1: 1) Downloaded 5,234,567,890 bytes in 6 seconds
2025-10-14T12:35:03 prefetch.2.11.1: Caching file within /.../cache
The .sra files are now cached and ready for conversion.
Converting to FASTQ with fasterq-dump
Now convert those .sra files to FASTQ format, which is what your downstream tools expect:
fasterq-dump --split-files SRR25401000 -O /path/to/output/
The --split-files flag is crucial for paired-end reads. It creates two files:
SRR25401000_1.fastq(read 1)SRR25401000_2.fastq(read 2)
For single-end, you get one file: SRR25401000.fastq
For multiple runs:
while read accession; do
fasterq-dump --split-files "$accession" -O ./fastq_files/
done < accessions.txt
This will take a while. A single RNA-seq run is typically 30-100 GB uncompressed. Progress output looks like:
spots read : 123,456,789
reads read : 246,913,578 (paired: 123,456,789)
reads written : 246,913,578
Compressing FASTQs
Your FASTQ files are uncompressed and huge. Compress them:
gzip ./fastq_files/*.fastq
This reduces file size by roughly 70-80% without losing information. Your pipeline tools will read .fastq.gz directly.
Downloading via ENA (Faster Alternative)
The European Nucleotide Archive is a mirror of SRA with faster FTP servers, especially for users in Europe or with unreliable connections to NCBI servers.
Getting FTP Links from ENA
For each SRR accession, construct the ENA FTP URL:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{first6digits}/{last3digits}/{full_accession}/{full_accession}_1.fastq.gz
For example, SRR25401000 becomes:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR254/010/SRR25401000/SRR25401000_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR254/010/SRR25401000/SRR25401000_2.fastq.gz
Batch Download with wget
Create a script to download all:
#!/bin/bash
while read accession; do
base_url="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${accession:0:6}/${accession:9:3}/${accession}/"
wget "${base_url}${accession}_1.fastq.gz" -O "./fastq_files/${accession}_1.fastq.gz"
wget "${base_url}${accession}_2.fastq.gz" -O "./fastq_files/${accession}_2.fastq.gz"
done < accessions.txt
This is often faster than SRA Toolkit, especially on international connections. wget automatically resumes interrupted downloads if you run the command again.
Downloading Metadata Programmatically
You have your FASTQ files, but you also need the metadata: sample names, treatment groups, replicate numbers, and other experimental design information. This is usually in the GEO series page, but extracting it programmatically is more reliable.
Using pysradb
pysradb is a Python tool that queries SRA directly and returns structured metadata.
Install:
pip install pysradb
Or via conda:
conda install -c bioconda pysradb
Download metadata for an SRA project (identified by its SRP number, found on the GEO series page):
pysradb download --srp SRP325829 --outdir ./metadata/
This creates a folder with .csv files containing:
sra_result.csv— all run info (accession, library layout, bases, spots, etc.)study.csv— study metadatasample.csv— sample informationexperiment.csv— sequencing experiment details
Parse the CSV in your favorite language. For example, in R:
metadata <- read.csv("./metadata/sra_result.csv")
head(metadata)
Or in Python:
import pandas as pd
metadata = pd.read_csv("./metadata/sra_result.csv")
print(metadata[['run_accession', 'sample_title', 'library_layout']])
This metadata ties each SRR to its sample and experimental group, which you’ll need when setting up your DESeq2 design matrix later.
Common Errors and Fixes
Error 1: “Cannot open cache directory”
Cause: SRA Toolkit can’t write to the configured cache location.
Fix: Check the cache path exists and is writable:
ls -la /path/to/sra_cache/
chmod 755 /path/to/sra_cache/
Or reconfigure to use a temp directory:
vdb-config --interactive
Error 2: “Connection timeout” during prefetch
Cause: Network issues or NCBI server overload.
Fix: Try downloading from ENA instead (see section above). If using prefetch, resume the download:
prefetch SRR25401000 # Will resume from where it left off
Error 3: “No space left on device”
Cause: Uncompressed FASTQ files are larger than anticipated.
Fix: Stop immediately. Compress existing FASTQs:
gzip ./fastq_files/*.fastq
Then check disk:
df -h
du -sh ./fastq_files/
If you’re still out of space, consider keeping only compressed .fastq.gz and deleting the intermediate .sra cache files:
rm -r ~/.ncbi/public/sra/
Error 4: “Corrupt download / CRC error”
Cause: Interrupted download. The .fastq.gz file is incomplete.
Fix: Delete and re-download. Most tools have resume capabilities:
rm ./fastq_files/SRR25401000_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR254/010/SRR25401000/SRR25401000_1.fastq.gz
Error 5: “fasterq-dump: Command not found”
Cause: SRA Toolkit not in PATH or conda environment not activated.
Fix:
conda activate your_bioinformatics_env
which fasterq-dump # Should show the path
If still not found, reinstall:
conda install -c bioconda sra-tools --force-reinstall
FASTQ File Naming Gotcha
Some downstream tools (like STAR) expect paired-end FASTQ files to follow a specific naming pattern. SRA Toolkit produces:
SRR25401000_1.fastq.gz
SRR25401000_2.fastq.gz
But some tools expect:
sample_R1.fastq.gz
sample_R2.fastq.gz
The pattern doesn’t matter as long as your pipeline configuration knows which files are R1 and R2. But if tools complain that files don’t exist or aren’t recognized, check the naming first. Rename if needed:
mv SRR25401000_1.fastq.gz sample_R1.fastq.gz
mv SRR25401000_2.fastq.gz sample_R2.fastq.gz
Verifying Your Downloads
Once all files are downloaded, check them:
# Count total reads in a file
zcat ./fastq_files/SRR25401000_1.fastq.gz | wc -l
# Divide by 4 because FASTQ has 4 lines per read
Expected output: a number that matches the “Spots” field in the SRA Run Selector. For example, 123,456,789 lines / 4 = 30,864,197 reads.
Spot-check file integrity:
zcat ./fastq_files/SRR25401000_1.fastq.gz | head -4
Should show:
@SRR25401000.1 xxx xxx length=xx
ACGTACGTACGTACGT...
+
IIIIIIIIIIIIIIII...
If the header doesn’t start with ”@” or the quality line with ”+”, the file is corrupted. Re-download.
Next Steps
You now have a complete set of raw RNA-seq FASTQ files and their metadata. The next step is alignment and quantification. If you haven’t already, read “How to Build a Bulk RNA-seq Pipeline: STAR, RSEM, DESeq2” to learn how to take these FASTQs through quality control, alignment, and gene expression quantification.
Before you start that pipeline, ensure your computing environment is set up correctly. If you haven’t used conda or mamba yet, “How to Set Up a Bioinformatics Environment with Conda and Mamba” covers everything you need.
The commands and workflow you’ve learned here work for any SRA dataset. You now have the skills to download data from any published study, any disease context, any sequencing platform. The public sequencing archives contain petabytes of data. You have the keys.