RNA-seq Read Trimming: fastp vs. Trim Galore

Introduction

You’ve just downloaded raw FASTQ files from the Sequence Read Archive (SRA) or they’ve rolled off your sequencer, and now you face a decision: should you trim these reads, and if so, which tool do you use?

This is where most RNA-seq pipelines start to diverge. Some labs trim aggressively. Others skip trimming entirely, betting that modern aligners will handle adapter contamination. Both approaches can be justified, but the decision should be informed by your actual data quality, not just pipeline folklore.

By the end of this post, you will have a working understanding of when trimming matters, how to assess your reads with FastQC, and hands-on experience with the two dominant trimming tools: fastp and Trim Galore. You will also know which tool fits your workflow and why.

Prerequisites

Before starting, you will need:

A conda or mamba environment set up. If you do not have this, follow the guide How to Set Up a Bioinformatics Environment with Conda and Mamba to get started.
Raw FASTQ files from RNA-seq. If you need to download these from the public databases, start with How to Download RNA-seq Data from GEO and SRA.
A basic understanding of paired-end vs. single-end reads (no deep knowledge required, but helpful context).

Do You Actually Need to Trim?

This question does not get asked often enough, and the honest answer is: maybe not.

Modern RNA-seq aligners like STAR and HISAT2 are robust. They can clip adapters on the fly and tolerate moderate quality issues at read ends. For well-prepared RNA-seq libraries from a good core facility, trimming may add nothing except compute time.

However, trimming becomes important when:

Your data has high adapter contamination. If FastQC shows “Adapter Content” flagged as a problem (see below), adapters are present in a significant fraction of reads.
Quality scores drop sharply at the 3’ end. If the last 5-10 bases of your reads have Phred scores below 20, trimming helps. This is especially relevant for 3’ counting methods like nanoCAGE or CAGE, where the 3’ end is the signal.
Insert size is very short. If you have many reads shorter than 50 bases, they may not align well even with soft-clipping. Trimming aggressively can mean losing these reads entirely, which is usually fine.
You’re using a pipeline that expects it. Some workflows (particularly those designed for Illumina TruSeq data) assume trimmed input.

For most bulk RNA-seq projects from modern instruments with standard protocols, trimming is optional. That said, running FastQC takes five minutes and provides clarity. Do it first.

Running FastQC First: Assess Before You Trim

FastQC gives you the baseline. Run it on your raw FASTQ files before any trimming.

# Install FastQC if you don't have it
conda install -c bioconda fastqc

# Run FastQC on a single FASTQ file
fastqc sample_R1.fastq.gz

# For paired-end, run on both reads
fastqc sample_R1.fastq.gz sample_R2.fastq.gz

# Generate report for the whole directory
fastqc *.fastq.gz

This creates .html and .zip output files for each input file. Open the HTML file in a browser.

What to look for:

Per base sequence quality: A graph showing Phred score (y-axis, ideally >30) across read position. Green is good, yellow is acceptable, red is poor. If scores drop below 20 in the last 10% of reads, trimming to that length is reasonable.
Adapter content: Shows the percentage of reads containing adapter sequences. Ideally this is <1%. If it is >10%, you have a trimming problem.
Per base GC content: Should show a smooth single hump (if bimodal, you may have contamination, but this is rare in RNA-seq).
Overrepresented sequences: Flags sequences that appear in >0.1% of reads. These are often adapters or rRNA.

If all metrics are green except perhaps a minor flag or two, trimming is probably not needed. If you see red flags in quality or adapter content, proceed to trimming.

Trimming with fastp

fastp is a newer tool that has become the default in many production pipelines. It is fast, feature-rich, and generates a useful HTML report.

Installation:

conda install -c bioconda fastp

Basic command for paired-end reads:

fastp \
  -i sample_R1.fastq.gz \
  -I sample_R2.fastq.gz \
  -o trimmed_R1.fastq.gz \
  -O trimmed_R2.fastq.gz \
  --json fastp_report.json \
  --html fastp_report.html

Key parameters explained:

-i / -I: Input R1 and R2 FASTQ files (gzip is automatically detected).
-o / -O: Output trimmed R1 and R2 files.
--detect_adapter_for_pe: Automatically detect and trim adapters. Highly recommended for paired-end data (fastp infers the adapter sequence from the overlap of R1 and R2). Without this flag, you must specify adapters manually. For RNA-seq, add this flag.
--qualified_quality_phred 20: Trim bases with Phred score below 20. The default is 15. For RNA-seq, 20 is reasonable.
--length_required 50: Discard reads shorter than 50 bases after trimming. The default is 15. For RNA-seq, 50 prevents spurious short alignments, but 30-40 is also acceptable.
--thread 8: Number of threads. Adjust based on your CPU. fastp is very fast, even with 2-4 threads.

A realistic production command:

fastp \
  -i sample_R1.fastq.gz \
  -I sample_R2.fastq.gz \
  -o trimmed_R1.fastq.gz \
  -O trimmed_R2.fastq.gz \
  --detect_adapter_for_pe \
  --qualified_quality_phred 20 \
  --length_required 50 \
  --thread 8 \
  --json fastp_report.json \
  --html fastp_report.html

The HTML report shows before/after read length distributions, quality scores, and adapter removal stats. It is the easiest way to confirm trimming worked as expected.

Trimming with Trim Galore

Trim Galore is a wrapper around Cutadapt and FastQC. It is slightly slower than fastp but is more transparent about what it is doing and plays nicely with Cutadapt if you need fine-grained adapter control.

Installation:

conda install -c bioconda trim-galore

Basic command for paired-end reads:

trim_galore \
  --paired \
  --quality 20 \
  --length 50 \
  --fastqc \
  --cores 8 \
  --output_dir trimmed/ \
  sample_R1.fastq.gz \
  sample_R2.fastq.gz

Key parameters explained:

--paired: Process R1 and R2 as a pair. Trim Galore will keep or discard both reads from a pair together, avoiding single-read orphans.
--quality 20: Trim bases with Phred score below 20 (Cutadapt default is 20, so this is standard).
--length 50: Discard reads shorter than 50 bases after trimming.
--fastqc: Run FastQC on the trimmed reads automatically. Saves a step and ensures QC reports are consistent.
--cores 8: Parallel threads.
--output_dir trimmed/: Write results to a subdirectory.
Input files are positional arguments.

Output:

Trim Galore creates files like:

sample_R1_val_1.fq.gz
sample_R2_val_2.fq.gz
trimmed_fastqc/ (directory with FastQC results)

The _val_1 and _val_2 suffixes are Trim Galore’s convention for trimmed paired-end reads.

fastp vs. Trim Galore: Head-to-Head

Both tools do the job. The choice depends on your workflow preferences and what matters to you.

Dimension	fastp	Trim Galore
Speed	~3-5 minutes for 100M paired reads	~10-15 minutes (slower, wraps Cutadapt)
Adapter detection	Automatic (infers from R1/R2 overlap)	Manual (specify adapter sequence)
Built-in QC report	Yes (JSON + HTML, very detailed)	No (but runs FastQC separately)
Ease of use	Simpler for paired-end defaults	More transparent, more control
Development	Active (GitHub, regular updates)	Stable but less frequent updates
Learning curve	Low (sensible defaults)	Low (straightforward Cutadapt wrapper)

Winner for most RNA-seq labs: fastp. It is faster, has reasonable defaults, and the auto-adapter detection is a real convenience for paired-end data. The HTML report is also more immediately useful than running FastQC separately.

Winner if you need precision control: Trim Galore. If you are adapting trimming parameters for a specific protocol or have non-standard adapters, Cutadapt’s explicit control over adapter sequences is worth the speed penalty.

Running MultiQC After Trimming

If you are processing many samples, MultiQC aggregates all your trimming reports into a single browsable summary.

# Install MultiQC
conda install -c bioconda multiqc

# Run fastp trimming on all samples (in a loop or batch script)
for file in *_R1.fastq.gz; do
  base=$(basename "$file" _R1.fastq.gz)
  fastp \
    -i "${base}_R1.fastq.gz" \
    -I "${base}_R2.fastq.gz" \
    -o trimmed/"${base}_R1.fastq.gz" \
    -O trimmed/"${base}_R2.fastq.gz" \
    --detect_adapter_for_pe \
    --qualified_quality_phred 20 \
    --length_required 50 \
    --thread 8 \
    --json results/"${base}.fastp.json" \
    --html results/"${base}.fastp.html"
done

# Aggregate reports
multiqc results/ -o multiqc_report/

Open multiqc_report/multiqc_report.html in your browser. You now have a side-by-side view of adapter removal, quality distributions, and trimming success across all samples.

Common Mistakes

Over-trimming: Trimming aggressively to remove every low-quality base sounds safe but can remove legitimate signal, especially from 3’ counting methods. Do not trim more than necessary. If FastQC shows quality scores are fine, do not trim.

Trimming when you should not: Running fastp or Trim Galore takes compute time and produces new FASTQ files you have to manage. If your FastQC is clean and your aligner is robust, skipping trimming is faster and keeps your file count down. Many production pipelines skip this step entirely.

Forgetting to validate after trimming: Always run FastQC on your trimmed reads. If trimming removed so many reads that you have vastly fewer alignments later, you trimmed too hard. If trimming did nothing, you may not have needed it. The before/after comparison is your safeguard.

Using wrong parameters for your method: RNA-seq from a standard bulk RNA-seq protocol tolerates different parameter choices than ATAC-seq or small RNA-seq. If you are adapting parameters, know what your data looks like first (that is what FastQC is for).

Next Steps

You now have trimmed, quality-controlled FASTQ files ready for alignment. The natural next step is to align these reads to a reference genome.

For bulk RNA-seq, the standard approach is STAR for alignment followed by quantification with RSEM or featureCounts. See the tutorial How to Build a Bulk RNA-seq Pipeline: STAR, RSEM, DESeq2 for the next steps.

If you are building this pipeline in a container or workflow system, you will also want to review Docker and Singularity for Bioinformatics or How to Run Your First Nextflow Pipeline.