Short read quality evaluation and trimming¶
Getting started¶
Local installs and Web site¶
- make sure you are using the latest version of Google Chrome
- here is an etherpad link for copy/paste
Software we’ll be using on the server¶
- Trimmomatic
- FastQC
- khmer
- bowtie2
- a few custom scripts in the workshop repository
The Dockerfile shows the basic installation instructions for an Ubuntu machine.
A first test¶
Click here:
Once you’re at the console, go to New... Terminal... and copy/paste this text into it:
echo hello, world
If it doesn’t work, could you e-mail me with your browser and OS version, please?
If copy/paste doesn’t work for you, you can do the following:
open a new file in the editor, and name it ‘commands.sh’;
copy/paste stuff into commands.sh, clearing it each time;
to execute the commands, run:
. commands.sh
Next: Some background
Some background¶
Discussion of how Illumina sequencing works - base calling.
The Illumina machine encodes its quality information in the FASTQ format.
Quality scores are provided per base, in a log-10 scaled format. The likelihood of a base call being erroneous is 10 to the power of (-Q/10) - so for a Q value of 3, you get an error rate of 1/2, for a Q value of 30, you get an error rate of 0.001, and for a Q value of 40, you get an error rate of .0001.
Illumina typically uses a Q value of 2 to mean “ignore this base.”
It’s easy to know what to do with Ns and bad bases. But the challenge for us with Q scores is that they are statistical in nature - if we throw away every base with a Q value of 30, then 99.9% of the data we are discarding will be correct!
What do we use Illumina sequencing for?¶



In the end, the question is, what effect will losing real data have vs including spurious data, given the biological goal and the tools you have to reach it?
Basically, trimming tends to decrease sensitivity, while increasing specificity.
Basic Recommendations:¶
Always adapter trim! It doesn’t do any harm.
Impose a length filter to throw away short sequences - something like 50. This will eliminate newly-useless reads that might mismap.
- for quantification (RNAseq), trim lightly
- for RNAseq assembly, trim lightly
- for variant calling, trim stringently
...unless otherwise indicated, or if you have biological reasons otherwise.
The basic logic is that stringent trimming can remove lots of good sequence; in addition to removing actual errors.

In case it’s not clear, trimming (or not) can have real impact on your outcome –
See Questioning the evidence for non-canonical RNA editing in humans; and also Trimming RNAseq.
Evaluating samples with FastQC¶
Download some data¶
Let’s grab some E. coli genomic data from Chitsaz et al:
cd ~/notebooks
mkdir ecoli
cd ecoli
curl -u single-cell:SanDiegoCA http://bix.ucsd.edu/projects/singlecell/nbt_data/ecoli_ref.fastq.bz2 | \
bunzip2 -c | head -800000 | gzip -9c > ecoli-pe.fq.gz
This data comes already interleaved; let’s split it up.
split-paired-reads.py ecoli-pe.fq.gz
mv ecoli-pe.fq.gz.1 ecoli-R1.fq
mv ecoli-pe.fq.gz.2 ecoli-R2.fq
gzip ecoli-*.fq
You can take a look at the data files with ‘less’:
zless ecoli-R1.fq.gz
Next, grab some yeast mRNAseq data:
cd ~/notebooks
mkdir yeast
cd yeast
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453569/SRR453569_1.fastq.gz | \
gunzip -c | \
head -400000 | \
gzip -9c > yeast-rnaseq-R1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453569/SRR453569_2.fastq.gz | \
gunzip -c | \
head -400000 | \
gzip -9c > yeast-rnaseq-R2.fastq.gz
Running fastqc¶
You can run fastqc on all four of these files like so:
cd ~/notebooks/ecoli
fastqc ecoli-R1.fq.gz
fastqc ecoli-R2.fq.gz
cd ~/notebooks/yeast
fastqc yeast-rnaseq-R1.fastq.gz
fastqc yeast-rnaseq-R2.fastq.gz
To look at the report, you can download the resulting zip files through the console, unpack the zip file, and open the fastqc_report.html in a browser. You can also click on the fastqc report HTML file in the console, and then change the URL from ‘edit’ to ‘files’.
Observations:
- RNAseq has biases in its first 10 bp. This is not from errors but instead due to random priming.
- Sequence duplication levels may be “high” for high-coverage RNAseq (not shown here)
- All sequences have the same length pre-trim; this is how Illumina produces data, and it’s one reason why trimming is important.
- R1 is always better than R2.
Trimming reads with Trimmomatic¶
Note: the trimmomatic manual is the go-to-guide for parameters and commands.
E. coli trimming¶
Trim the E. coli data lightly:
cd ~/notebooks
cd ecoli
java -jar /home/Trimmomatic-0.36/trimmomatic-0.36.jar PE \
ecoli-R1.fq.gz ecoli-R2.fq.gz \
ecoli-R1-pe.fq ecoli-R1-orphans.fq ecoli-R2-pe.fq ecoli-R2-orphans.fq \
ILLUMINACLIP:/home/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:50
You should see:
Input Read Pairs: 100000 Both Surviving: 99802 (99.80%) Forward Only Surviving: 186 (0.19%) Reverse Only Surviving: 12 (0.01%) Dropped: 0 (0.00%)
Trim the E. coli data stringently:
java -jar /home/Trimmomatic-0.36/trimmomatic-0.36.jar PE \
ecoli-R1.fq.gz ecoli-R2.fq.gz \
ecoli-R1-pe.fq ecoli-R1-orphans.fq ecoli-R2-pe.fq ecoli-R2-orphans.fq \
ILLUMINACLIP:/home/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:20 \
MINLEN:50
You should see:
Input Read Pairs: 100000 Both Surviving: 83439 (83.44%) Forward Only Surviving: 8960 (8.96%) Reverse Only Surviving: 4821 (4.82%) Dropped: 2780 (2.78%)
You can use khmer’s readstats.py to evaluate the loss of sequence –
readstats.py ecoli-R?.fq.gz
readstats.py ecoli-R?-*.fq
Now, run fastqc on the trimmed data:
fastqc ecoli-R1-pe.fq
fastqc ecoli-R2-pe.fq
Trimming the yeast RNAseq data¶
Next, try building your own commands to trim the yeast RNAseq data - stringently, and lightly.
Next: Evaluating read mismatch statistics: mapping and de novo
Evaluating read mismatch statistics: mapping and de novo¶
Looking at read mismatch profiles by mapping to a reference¶
One of my favorite approaches to looking at error bias in read data sets is not to just use the Q score, but to do something empirical instead – for example, if you already have a reference, you can map the reads to the reference and look at mapping mismatches.
Do a mapping with Bowtie2:
cd ~/notebooks/ecoli
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
gunzip ecoliMG1655.fa.gz
bowtie2-build ecoliMG1655.fa ecoli
gunzip -c ecoli-R1.fq.gz | bowtie2 -p 4 -x ecoli -U - -S ecoli-R1.sam
then use a custom script to look at where in the read mapping there are mismatches:
~/notebooks/sam-scan-errhist.py ecoliMG1655.fa ecoli-R1.sam -o ecoli-mapped.errhist
Looking at the read mismatch profile de novo¶







To try this out with khmer, do:
cd ~/notebooks/ecoli
load-into-counting.py -M 1e8 -k 21 ecoli.kh ecoli-R1.fq.gz
~/notebooks/report-errhist-2pass.py -C 1 ecoli.kh ecoli-R1.fq.gz > ecoli-kmer.errhist
(Note, you don’t want to do k-mer abundance trimming in general, but it’s useful for evaluation.)
Look at the errhist files using ‘less’ or ‘tail’... Or plot them.
Can we do this for RNAseq? (Use the -V option on report-errhist-2pass).
Final thoughts on short-read trimming¶
A summary of recommendations¶
- Run FastQC before trimming; trim; look again.
- Always trim paired sequences together.
- Always adapter trim!
- Impose a length filter, 50.
- for quantification (RNAseq), trim lightly
- for RNAseq assembly, trim lightly
- for variant calling, trim stringently
- use the same trimming parameters on all your data unless you have a VERY good reason otherwise!
- ignore the first 10 bp composition bias in RNAseq;
- ignore sequence duplication levels in high-coverage RNAseq;
- look at your read positional bias with mapping (or de novo) as well;
Some references¶
MacManes, 2014, http://journal.frontiersin.org/article/10.3389/fgene.2014.00013/full - recommends gentle trimming for RNAseq.
Williams et al., 2015, http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0956-2 - recommends imposing a length filter.
Mbandi et al., 2014, http://journal.frontiersin.org/article/10.3389/fgene.2014.00017/full - complicated, but start with gentle trimming.
Del Fabbro et al., 2013, http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085024 - evaluation across data sets.
Indices and tables¶
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.