Evaluating read mismatch statistics: mapping and de novo

Looking at read mismatch profiles by mapping to a reference

One of my favorite approaches to looking at error bias in read data sets is not to just use the Q score, but to do something empirical instead – for example, if you already have a reference, you can map the reads to the reference and look at mapping mismatches.

Do a mapping with Bowtie2:

cd ~/notebooks/ecoli
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
gunzip ecoliMG1655.fa.gz

bowtie2-build ecoliMG1655.fa ecoli
gunzip -c ecoli-R1.fq.gz | bowtie2 -p 4 -x ecoli -U - -S ecoli-R1.sam

then use a custom script to look at where in the read mapping there are mismatches:

~/notebooks/sam-scan-errhist.py ecoliMG1655.fa ecoli-R1.sam -o ecoli-mapped.errhist

Looking at the read mismatch profile de novo

To try this out with khmer, do:

cd ~/notebooks/ecoli

load-into-counting.py -M 1e8 -k 21 ecoli.kh ecoli-R1.fq.gz
~/notebooks/report-errhist-2pass.py -C 1 ecoli.kh ecoli-R1.fq.gz > ecoli-kmer.errhist

(Note, you don’t want to do k-mer abundance trimming in general, but it’s useful for evaluation.)

Look at the errhist files using ‘less’ or ‘tail’... Or plot them.

Can we do this for RNAseq? (Use the -V option on report-errhist-2pass).

Next: Final thoughts on short-read trimming


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.