Introduction
If you’re new to bulk RNA-seq, the tooling landscape looks overwhelming. Do you use STAR or bowtie2? RSEM, featureCounts, or Salmon for quantification? DESeq2 or edgeR? Should you build your own pipeline or use nf-core?
The truth: for a one-time analysis or to understand what’s actually happening in each step, building a pipeline manually is worth the effort. You’ll know exactly where your data is, what parameters matter, and how to troubleshoot when something goes wrong. You’ll also gain the confidence to modify the pipeline for your specific experimental design.
This guide walks you through building a complete, working bulk RNA-seq pipeline using three industry-standard tools. STAR for alignment, RSEM for transcript quantification, and DESeq2 for differential expression analysis. Every command here is tested and real. By the end, you’ll have a reproducible pipeline that handles quality control, alignment, quantification, and statistical testing in one workflow.
If you want a pre-built, production-ready pipeline instead, nf-core/rnaseq exists for exactly that. But if you want to understand what’s happening at each step, read on.
Prerequisites
Before starting, you’ll need:
- Linux or macOS (Windows users can use WSL 2)
- ~100 GB free disk space (STAR index alone is 30 GB for human)
- A conda/mamba environment with Python 3.10+ and R 4.2+ installed (see our guide to conda setup)
- At least 16 GB RAM for alignment (32 GB is better)
- FASTQ files from your RNA-seq experiment (paired-end or single-end)
- A reference genome FASTA file and GTF annotation
For this tutorial, we’ll use human data (GRCh38) but the steps work for any organism with a reference genome.
Understanding the Pipeline Architecture
Before diving into commands, here’s what each step does:
- FastQC: Quality control on raw reads. Identifies adapter contamination, low-quality bases, and other issues.
- STAR alignment: Maps reads to the reference genome. STAR is fast, accurate, and handles splicing well.
- RSEM quantification: Estimates transcript and gene expression levels from aligned reads. Handles multi-mapping reads intelligently.
- DESeq2 analysis: Performs differential expression analysis and generates statistics on which genes change between conditions.
Together, these produce a count matrix you can use for downstream analysis.
Step 1: Install Required Tools
Create a dedicated conda environment for your RNA-seq work:
mamba create -n rnaseq -c bioconda -c conda-forge \
star=2.7.11 \
rsem=1.3.3 \
fastqc=0.12.1 \
samtools=1.19 \
python=3.11 \
r-base=4.3 \
bioconductor-deseq2 \
-y
Activate the environment:
mamba activate rnaseq
Verify installations:
STAR --version
rsem-calculate-expression --version
fastqc --version
samtools --version
Rscript --version
All commands should return version numbers without errors.
Step 2: Download and Index the Reference Genome
STAR requires a pre-built index of your reference genome. For human (GRCh38):
mkdir -p ~/rna-seq-data/references/GRCh38
cd ~/rna-seq-data/references/GRCh38
# Download FASTA and GTF from Ensembl
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
# Decompress
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.110.gtf.gz
# Build STAR index (this takes 30-60 minutes and 30 GB)
STAR --runMode genomeGenerate \
--genomeDir STAR_index \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile Homo_sapiens.GRCh38.110.gtf \
--sjdbOverhang 100 \
--runThreadN 8
The sjdbOverhang parameter should be set to (read_length - 1). If your reads are 100 bp, use 99. If 150 bp, use 149.
Once complete, you should see:
160000000 sequences processed in 10.00 minutes
Generating suffix arrays with 16 threads
Log.out indicates a successful build and writes "... Done!"
Step 3: Quality Control with FastQC
Run FastQC on a subset of your FASTQ files to check for quality issues before committing to the full analysis:
mkdir -p ~/rna-seq-data/results/fastqc
# Run FastQC on all FASTQ files
fastqc ~/rna-seq-data/raw/*.fastq.gz -o ~/rna-seq-data/results/fastqc -t 8
Open the HTML reports in a browser. Look for:
- Per base quality: Should show green (>Q30) for most positions. A decline toward the 3’ end is normal.
- Adapter content: If adapter libraries are present, you’ll need to trim them before alignment.
- GC content: Should show a single peak unless you have significant contamination.
If quality looks poor at the 3’ end, run Trim Galore or Cutadapt to remove low-quality bases:
mamba install trim-galore -y
# Trim bases with Q < 20 from the 3' end
trim_galore --quality 20 --paired \
~/rna-seq-data/raw/sample1_R1.fastq.gz \
~/rna-seq-data/raw/sample1_R2.fastq.gz \
-o ~/rna-seq-data/trimmed/
For this tutorial, we’ll assume your reads are high quality and skip trimming.
Step 4: Align Reads with STAR
STAR is fast and accurate for RNA-seq alignment. For each sample:
mkdir -p ~/rna-seq-data/results/star_bam
STAR --genomeDir ~/rna-seq-data/references/GRCh38/STAR_index \
--readFilesIn ~/rna-seq-data/raw/sample1_R1.fastq.gz ~/rna-seq-data/raw/sample1_R2.fastq.gz \
--readFilesCommand zcat \
--runThreadN 8 \
--outFileNamePrefix ~/rna-seq-data/results/star_bam/sample1_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped None \
--outFilterScoreMinOverLread 0.3 \
--outFilterMatchNminOverLread 0.3 \
--quantMode GeneCounts
Key parameters:
readFilesCommand zcat: Decompresses gzipped FASTQ files on the fly.outSAMtype BAM SortedByCoordinate: Outputs a sorted BAM file (required for RSEM).quantMode GeneCounts: Produces a gene count file that RSEM will use.outFilterScoreMinOverLreadandoutFilterMatchNminOverLread: Filter low-quality alignments (standard settings).
After alignment, you’ll see:
~/rna-seq-data/results/star_bam/
├── sample1_Aligned.sortedByCoord.out.bam
├── sample1_Aligned.sortedByCoord.out.bam.bai
├── sample1_Log.final.out
└── sample1_ReadsPerGene.out.tab
Check the Log.final.out file to see alignment statistics:
grep "Uniquely mapped reads" ~/rna-seq-data/results/star_bam/sample1_Log.final.out
For typical RNA-seq, you should see 70-90% uniquely mapped reads.
Step 5: Prepare Reference for RSEM
RSEM needs a reference in its own format. Convert the GTF and FASTA:
mkdir -p ~/rna-seq-data/references/GRCh38/RSEM_ref
rsem-prepare-reference \
--gtf ~/rna-seq-data/references/GRCh38/Homo_sapiens.GRCh38.110.gtf \
~/rna-seq-data/references/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
~/rna-seq-data/references/GRCh38/RSEM_ref/GRCh38
This creates several index files in the RSEM_ref directory.
Step 6: Quantify with RSEM
Run RSEM on each sample’s BAM file:
mkdir -p ~/rna-seq-data/results/rsem
rsem-calculate-expression \
--bam \
--estimate-rspd \
--paired-end \
--no-bam-output \
--seed 12345 \
-p 8 \
~/rna-seq-data/results/star_bam/sample1_Aligned.sortedByCoord.out.bam \
~/rna-seq-data/references/GRCh38/RSEM_ref/GRCh38 \
~/rna-seq-data/results/rsem/sample1
Key parameters:
--bam: Input is BAM format (from STAR).--estimate-rspd: Learns read start position distribution from the data (improves accuracy).--paired-end: Your data is paired-end (remove if single-end).--seed 12345: Ensures reproducibility.-p 8: Use 8 threads.
Output files:
~/rna-seq-data/results/rsem/
├── sample1.genes.results # Gene expression estimates
├── sample1.isoforms.results # Transcript-level estimates
└── sample1.stat/ # Diagnostic plots and statistics
The .genes.results file contains columns:
gene_id: Gene identifier from GTF.transcript_id(s): Transcripts in this gene.length: Effective gene length.effective_length: Adjusted for fragment length.expected_count: Estimated number of reads mapping to this gene.TPM: Transcripts per million (normalized expression).FPKM: Fragments per kilobase of exon per million mapped reads (older normalization).
Step 7: Merge Results and Prepare for DESeq2
Create a count matrix combining all samples:
mkdir -p ~/rna-seq-data/results/deseq2
# Extract gene IDs and expected counts from all RSEM outputs
head -1 ~/rna-seq-data/results/rsem/sample1.genes.results | cut -f1,5 > ~/rna-seq-data/results/deseq2/gene_counts_raw.txt
for sample in sample1 sample2 sample3; do
tail -n +2 ~/rna-seq-data/results/rsem/${sample}.genes.results | cut -f1,5 | \
awk -v s="$sample" '{print $1 "\t" $2}' >> ~/rna-seq-data/results/deseq2/gene_counts_raw.txt
done
Better approach: use RSEM’s built-in merge function:
rsem-generate-data-matrix \
~/rna-seq-data/results/rsem/sample1.genes.results \
~/rna-seq-data/results/rsem/sample2.genes.results \
~/rna-seq-data/results/rsem/sample3.genes.results \
> ~/rna-seq-data/results/deseq2/gene_counts_matrix.txt
This produces a matrix with gene IDs as rows and samples as columns, with expected counts as values.
Step 8: Differential Expression Analysis with DESeq2
Create an R script to run DESeq2 on your count matrix:
cat > ~/rna-seq-data/results/deseq2/deseq2_analysis.R << 'EOF'
library(DESeq2)
library(ggplot2)
# Load count matrix
counts <- read.table("gene_counts_matrix.txt", header=TRUE, row.names=1)
counts <- as.matrix(counts)
# Define sample metadata
# Modify this to match your experimental design
sample_metadata <- data.frame(
sample = c("sample1", "sample2", "sample3", "sample4", "sample5", "sample6"),
condition = c("control", "control", "control", "treatment", "treatment", "treatment"),
row.names = c("sample1", "sample2", "sample3", "sample4", "sample5", "sample6")
)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = sample_metadata,
design = ~ condition
)
# Pre-filter: remove genes with very low counts
dds <- dds[rowSums(counts(dds)) >= 10, ]
# Run DESeq2
dds <- DESeq(dds)
# Extract results
results <- results(dds, contrast = c("condition", "treatment", "control"))
results <- results[order(results$pvalue), ]
# Write results to file
write.csv(as.data.frame(results), "deseq2_results.csv", row.names=TRUE)
# Create MA plot
pdf("ma_plot.pdf")
plotMA(results, ylim=c(-5, 5))
dev.off()
# Create PCA plot
rld <- rlog(dds, blind=FALSE)
pdf("pca_plot.pdf")
plotPCA(rld, intgroup="condition")
dev.off()
# Print summary
summary(results)
EOF
Run the analysis:
cd ~/rna-seq-data/results/deseq2
Rscript deseq2_analysis.R
Output files:
deseq2_results.csv: Full results with log2 fold changes, p-values, and adjusted p-values.ma_plot.pdf: MA plot showing fold change vs. mean expression.pca_plot.pdf: PCA plot showing sample clustering by condition.
Step 9: Interpret Results
The key columns in deseq2_results.csv are:
baseMean: Average expression across all samples.log2FoldChange: Log2 ratio of treatment/control expression.pvalue: Raw p-value from statistical test.padj: Adjusted p-value (accounts for multiple testing).
Filter for significant genes (typically padj < 0.05):
awk 'NR==1 || $NF < 0.05' deseq2_results.csv | head -20
This shows the top 20 significant genes. Positive log2FoldChange means upregulated in treatment. Negative means downregulated.
Step 10: Troubleshooting and Best Practices
High multi-mapping rate in STAR alignment?
If alignment shows <60% uniquely mapped reads, your reads may be too short or your genome index was built incorrectly. Check that sjdbOverhang matches your read length minus 1.
Very few genes significant in DESeq2?
Check that your sample metadata (control vs. treatment) is correct. Look at the PCA plot. If treatment and control samples don’t cluster separately, the experimental manipulation may not have worked.
RSEM crashes on memory error?
Reduce threads (-p 4 instead of -p 8) or allocate more RAM.
Next Steps and Alternatives
For a production pipeline, consider switching to nf-core/rnaseq, which handles QC, alignment, quantification, and quality control in one reproducible workflow. It also supports multiple quantification methods (STAR + RSEM, STAR + Salmon, HISAT2, etc.) and runs on HPC clusters.
For your first analysis, though, building this manually teaches you where every file comes from and how to troubleshoot when something breaks. Once you’re confident, automate with Nextflow or Snakemake.
Read our guide on our Nextflow beginner guide to learn how to wrap this pipeline in a reproducible workflow manager.
For deeper statistical understanding, compare this to our DESeq2 vs. edgeR comparison post to see when to use alternatives.
If you want a book that ties together Unix, Python, R, and the data skills behind everything in this pipeline, Bioinformatics Data Skills by Vince Buffalo (O’Reilly) is the most practical reference available — it covers reproducible pipelines, file formats, and data wrangling in the context of real bioinformatics work.