Evaluating samples with FastQC¶

Download some data¶

Let’s grab some E. coli genomic data from Chitsaz et al:

cd ~/notebooks
mkdir ecoli
cd ecoli

curl -u single-cell:SanDiegoCA http://bix.ucsd.edu/projects/singlecell/nbt_data/ecoli_ref.fastq.bz2 | \
  bunzip2 -c | head -800000 | gzip -9c > ecoli-pe.fq.gz

This data comes already interleaved; let’s split it up.

split-paired-reads.py ecoli-pe.fq.gz
mv ecoli-pe.fq.gz.1 ecoli-R1.fq
mv ecoli-pe.fq.gz.2 ecoli-R2.fq
gzip ecoli-*.fq

You can take a look at the data files with ‘less’:

zless ecoli-R1.fq.gz

Next, grab some yeast mRNAseq data:

cd ~/notebooks
mkdir yeast
cd yeast

curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453569/SRR453569_1.fastq.gz | \
  gunzip -c | \
  head -400000 | \
  gzip -9c > yeast-rnaseq-R1.fastq.gz

curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453569/SRR453569_2.fastq.gz | \
  gunzip -c | \
  head -400000 | \
  gzip -9c > yeast-rnaseq-R2.fastq.gz

Running fastqc¶

You can run fastqc on all four of these files like so:

cd ~/notebooks/ecoli
fastqc ecoli-R1.fq.gz
fastqc ecoli-R2.fq.gz

cd ~/notebooks/yeast
fastqc yeast-rnaseq-R1.fastq.gz
fastqc yeast-rnaseq-R2.fastq.gz

To look at the report, you can download the resulting zip files through the console, unpack the zip file, and open the fastqc_report.html in a browser. You can also click on the fastqc report HTML file in the console, and then change the URL from ‘edit’ to ‘files’.

Observations:

RNAseq has biases in its first 10 bp. This is not from errors but instead due to random priming.
Sequence duplication levels may be “high” for high-coverage RNAseq (not shown here)
All sequences have the same length pre-trim; this is how Illumina produces data, and it’s one reason why trimming is important.
R1 is always better than R2.

Next: Trimming reads with Trimmomatic

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.