Monday, 29 September 2014

Why you should QC your reads AND your assembly

The genome sequence of the Common Carp Cyprinus carpio was published in Nature last week. By coincidence, I was doing some QC on some domesticated Ferret (Mustela ptorius furo) reads, which had thrown some kmer warnings in the FastQC tool. I blasted the kmers in NCBI and was quite perplexed by the number of hits that I found in the carp genome. Nearly all of the first 150 hits were all from the carp genome. Anyway, I looked a bit further into my odd kmers and it turns out that they were the ends of some Illumina adapter sequences that had presumably been incorporated into the paired-reads on the shorter ends of the insert size. This then took me back to the Carp Genome - what had creeped into that?

In the paper, the authors state that they used 454, Illumina and Solid sequencing and also used some previously published BAC-end sequences. The BAC-end and 454 sequences were assembled with the Celera assembler and the Illumina, Solid and 454 8kb mate-pair sequences were mapped to the assembly to construct the scaffolds. Finally, they used the paired-end information from the short paired-end reads to fill the gaps between the scaffolds. The final assembly consists of 9377 scaffolds.

The only quality control they speak of is "We then filtered out low-quality and short reads to obtain a set of usable reads".

So I thought I'd look at what was actually in their assembly. I downloaded the Carp genome assembly (9377 scaffolds) and created a blast database from it and then created a fasta file of Illumina adapter sequences (found here) and used them as query sequences to blast against the Carp genome. There is some redundancy in the Illumina adapter sequences, so I collapsed them, so retaining only unique sequences and then removed any adapter sequences that were sub-sequences of longer adapter (the final file consisted of 81 sequences). The blast resulted in 3750 hits (evalue < 8.00E-06) of which 1009 were of 100% identity.

This gave me a final tally of at least 20 Illumina adapter sequences incorporated into the final Common Carp genome assembly. Out of the 9377 scaffolds, 277 appears to have Illumina Adapter sequences in them. I've included the counts of the different Illumina adapter sequences (non-redundant) for the scaffolds at the bottom of the page.

I've not looked for adapter sequences used in Solid or 454 sequencing yet. It would be interesting to see what that throws up.

So, a lesson to be learned here. QC your assembly, especially if you're not overly stringent with your read QC.

Here's the data:
Common Carp genome scaffolds
Illumina adapter sequences
Illumina adapter sequences collapsed
Illumina adapters v Carp genome blast

No. of adapter sequences per scaffold (Scafold-name, adapter count):
gi|685042180|emb|LN590705.1|    17
gi|685042192|emb|LN590717.1|    16
gi|685042440|emb|LN590965.1|    16
gi|685042175|emb|LN590700.1|    16
gi|685042188|emb|LN590713.1|    16
gi|685042151|emb|LN590676.1|    15
gi|685042147|emb|LN590672.1|    15
gi|685042187|emb|LN590712.1|    15
gi|685042160|emb|LN590685.1|    15
gi|685042155|emb|LN590680.1|    15
gi|685042173|emb|LN590698.1|    15
gi|685042162|emb|LN590687.1|    15
gi|685042168|emb|LN590693.1|    15
gi|685042169|emb|LN590694.1|    15
gi|685042181|emb|LN590706.1|    15
gi|685046009|emb|LN594534.1|    15
gi|685047710|emb|LN596222.1|    15
gi|685045925|emb|LN594450.1|    15
gi|685042309|emb|LN590834.1|    15
gi|685042195|emb|LN590720.1|    15
gi|685042524|emb|LN591049.1|    15
gi|685042157|emb|LN590682.1|    15
gi|685042158|emb|LN590683.1|    15
gi|685042443|emb|LN590968.1|    15
gi|685042186|emb|LN590711.1|    15
gi|685042159|emb|LN590684.1|    15
gi|685042165|emb|LN590690.1|    15
gi|685042189|emb|LN590714.1|    15
gi|685042166|emb|LN590691.1|    15
gi|685042177|emb|LN590702.1|    15
gi|685042171|emb|LN590696.1|    15
gi|685042388|emb|LN590913.1|    15
gi|685042184|emb|LN590709.1|    15
gi|685042163|emb|LN590688.1|    15
gi|685042306|emb|LN590831.1|    15
gi|685042176|emb|LN590701.1|    15
gi|685042161|emb|LN590686.1|    15
gi|685042153|emb|LN590678.1|    15
gi|685042167|emb|LN590692.1|    15
gi|685042172|emb|LN590697.1|    15
gi|685042152|emb|LN590677.1|    15
gi|685042148|emb|LN590673.1|    15
gi|685042164|emb|LN590689.1|    15
gi|685042156|emb|LN590681.1|    15
gi|685042191|emb|LN590716.1|    14
gi|685047575|emb|LN596087.1|    14
gi|685042149|emb|LN590674.1|    14
gi|685046121|emb|LN594646.1|    14
gi|685042178|emb|LN590703.1|    14
gi|685042179|emb|LN590704.1|    14
gi|685042146|emb|LN590671.1|    14
gi|685046718|emb|LN595243.1|    14
gi|685042530|emb|LN591055.1|    14
gi|685042206|emb|LN590731.1|    14
gi|685043277|emb|LN591802.1|    14
gi|685048916|emb|LN597428.1|    13
gi|685047677|emb|LN596189.1|    13
gi|685047040|emb|LN595565.1|    13
gi|685045865|emb|LN594390.1|    13
gi|685042663|emb|LN591188.1|    13
gi|685044585|emb|LN593110.1|    13
gi|685047827|emb|LN596339.1|    13
gi|685042350|emb|LN590875.1|    13
gi|685049067|emb|LN597579.1|    13
gi|685044342|emb|LN592867.1|    13
gi|685049565|emb|LN598077.1|    13
gi|685045018|emb|LN593543.1|    13
gi|685045942|emb|LN594467.1|    13
gi|685042445|emb|LN590970.1|    13
gi|685049785|emb|LN598297.1|    13
gi|685049099|emb|LN597611.1|    13
gi|685046114|emb|LN594639.1|    13
gi|685047586|emb|LN596098.1|    13
gi|685042174|emb|LN590699.1|    13
gi|685045086|emb|LN593611.1|    13
gi|685043017|emb|LN591542.1|    13
gi|685049950|emb|LN598462.1|    13
gi|685042883|emb|LN591408.1|    13
gi|685046299|emb|LN594824.1|    13
gi|685046059|emb|LN594584.1|    13
gi|685046522|emb|LN595047.1|    13
gi|685042277|emb|LN590802.1|    13
gi|685042502|emb|LN591027.1|    13
gi|685042566|emb|LN591091.1|    13
gi|685047850|emb|LN596362.1|    13
gi|685049978|emb|LN598497.1|    13
gi|685045708|emb|LN594233.1|    13
gi|685050802|emb|LN599314.1|    13
gi|685042573|emb|LN591098.1|    13
gi|685046758|emb|LN595283.1|    13
gi|685042781|emb|LN591306.1|    13
gi|685042481|emb|LN591006.1|    13
gi|685042234|emb|LN590759.1|    13
gi|685050149|emb|LN598661.1|    13
gi|685042332|emb|LN590857.1|    13
gi|685049842|emb|LN598354.1|    13
gi|685047990|emb|LN596502.1|    13
gi|685050795|emb|LN599307.1|    13
gi|685045935|emb|LN594460.1|    13
gi|685042170|emb|LN590695.1|    13
gi|685046234|emb|LN594759.1|    13
gi|685048729|emb|LN597241.1|    13
gi|685051416|emb|LN599928.1|    13
gi|685042334|emb|LN590859.1|    13
gi|685043813|emb|LN592338.1|    13
gi|685048940|emb|LN597452.1|    13
gi|685046795|emb|LN595320.1|    13
gi|685042278|emb|LN590803.1|    13
gi|685047058|emb|LN595583.1|    13
gi|685046063|emb|LN594588.1|    13
gi|685042802|emb|LN591327.1|    13
gi|685042242|emb|LN590767.1|    13
gi|685046967|emb|LN595492.1|    13
gi|685045725|emb|LN594250.1|    13
gi|685044439|emb|LN592964.1|    13
gi|685043936|emb|LN592461.1|    13
gi|685043992|emb|LN592517.1|    13
gi|685045281|emb|LN593806.1|    13
gi|685042185|emb|LN590710.1|    13
gi|685042190|emb|LN590715.1|    13
gi|685042383|emb|LN590908.1|    13
gi|685042494|emb|LN591019.1|    13
gi|685044709|emb|LN593234.1|    13
gi|685042470|emb|LN590995.1|    13
gi|685042377|emb|LN590902.1|    13
gi|685044437|emb|LN592962.1|    13
gi|685044971|emb|LN593496.1|    13
gi|685042304|emb|LN590829.1|    13
gi|685050005|emb|LN598517.1|    13
gi|685047355|emb|LN595867.1|    13
gi|685042460|emb|LN590985.1|    13
gi|685042690|emb|LN591215.1|    13
gi|685049916|emb|LN598428.1|    13
gi|685042409|emb|LN590934.1|    13
gi|685045157|emb|LN593682.1|    13
gi|685045547|emb|LN594072.1|    13
gi|685042545|emb|LN591070.1|    13
gi|685045322|emb|LN593847.1|    13
gi|685046213|emb|LN594738.1|    13
gi|685042640|emb|LN591165.1|    13
gi|685042774|emb|LN591299.1|    13
gi|685042247|emb|LN590772.1|    13
gi|685042281|emb|LN590806.1|    13
gi|685048206|emb|LN596718.1|    13
gi|685042314|emb|LN590839.1|    13
gi|685042193|emb|LN590718.1|    13
gi|685042236|emb|LN590761.1|    13
gi|685042194|emb|LN590719.1|    13
gi|685043901|emb|LN592426.1|    13
gi|685047157|emb|LN595682.1|    13
gi|685049794|emb|LN598306.1|    13
gi|685043829|emb|LN592354.1|    3
gi|685049413|emb|LN597925.1|    2
gi|685042389|emb|LN590914.1|    2
gi|685048207|emb|LN596719.1|    2
gi|685042244|emb|LN590769.1|    2
gi|685042986|emb|LN591511.1|    2
gi|685049771|emb|LN598283.1|    2
gi|685042593|emb|LN591118.1|    2
gi|685048211|emb|LN596723.1|    2
gi|685042612|emb|LN591137.1|    2
gi|685046950|emb|LN595475.1|    2
gi|685048221|emb|LN596733.1|    2
gi|685047075|emb|LN595600.1|    2
gi|685042674|emb|LN591199.1|    2
gi|685044336|emb|LN592861.1|    2
gi|685042390|emb|LN590915.1|    2
gi|685044811|emb|LN593336.1|    2
gi|685042505|emb|LN591030.1|    2
gi|685042980|emb|LN591505.1|    2
gi|685045621|emb|LN594146.1|    2
gi|685043038|emb|LN591563.1|    2
gi|685046462|emb|LN594987.1|    2
gi|685046214|emb|LN594739.1|    2
gi|685047284|emb|LN595796.1|    2
gi|685042358|emb|LN590883.1|    2
gi|685048215|emb|LN596727.1|    2
gi|685047993|emb|LN596505.1|    2
gi|685046217|emb|LN594742.1|    2
gi|685047762|emb|LN596274.1|    2
gi|685044387|emb|LN592912.1|    2
gi|685046246|emb|LN594771.1|    2
gi|685042182|emb|LN590707.1|    2
gi|685046578|emb|LN595103.1|    2
gi|685046705|emb|LN595230.1|    2
gi|685042359|emb|LN590884.1|    2
gi|685043150|emb|LN591675.1|    2
gi|685043298|emb|LN591823.1|    2
gi|685043842|emb|LN592367.1|    2
gi|685044446|emb|LN592971.1|    2
gi|685044108|emb|LN592633.1|    2
gi|685045467|emb|LN593992.1|    2
gi|685046126|emb|LN594651.1|    2
gi|685044178|emb|LN592703.1|    2
gi|685048451|emb|LN596963.1|    2
gi|685049697|emb|LN598209.1|    2
gi|685045317|emb|LN593842.1|    2
gi|685042643|emb|LN591168.1|    2
gi|685050038|emb|LN598550.1|    2
gi|685046698|emb|LN595223.1|    2
gi|685049624|emb|LN598136.1|    2
gi|685042248|emb|LN590773.1|    2
gi|685043270|emb|LN591795.1|    2
gi|685042424|emb|LN590949.1|    2
gi|685044859|emb|LN593384.1|    2
gi|685046260|emb|LN594785.1|    2
gi|685042219|emb|LN590744.1|    2
gi|685042453|emb|LN590978.1|    2
gi|685051535|emb|LN600047.1|    2
gi|685049977|emb|LN598496.1|    2
gi|685042531|emb|LN591056.1|    2
gi|685042462|emb|LN590987.1|    2
gi|685046396|emb|LN594921.1|    2
gi|685047630|emb|LN596142.1|    2
gi|685044174|emb|LN592699.1|    2
gi|685044053|emb|LN592578.1|    2
gi|685043877|emb|LN592402.1|    2
gi|685042382|emb|LN590907.1|    2
gi|685045195|emb|LN593720.1|    2
gi|685042326|emb|LN590851.1|    2
gi|685042456|emb|LN590981.1|    2
gi|685049801|emb|LN598313.1|    2
gi|685043683|emb|LN592208.1|    2
gi|685042720|emb|LN591245.1|    2
gi|685047580|emb|LN596092.1|    2
gi|685042328|emb|LN590853.1|    2
gi|685049277|emb|LN597789.1|    2
gi|685042687|emb|LN591212.1|    2
gi|685046308|emb|LN594833.1|    2
gi|685042611|emb|LN591136.1|    2
gi|685042261|emb|LN590786.1|    2
gi|685044865|emb|LN593390.1|    2
gi|685045958|emb|LN594483.1|    2
gi|685043564|emb|LN592089.1|    2
gi|685042405|emb|LN590930.1|    2
gi|685044773|emb|LN593298.1|    2
gi|685044246|emb|LN592771.1|    2
gi|685048094|emb|LN596606.1|    2
gi|685045977|emb|LN594502.1|    2
gi|685042364|emb|LN590889.1|    2
gi|685045502|emb|LN594027.1|    2
gi|685042526|emb|LN591051.1|    2
gi|685042275|emb|LN590800.1|    2
gi|685049991|emb|LN598471.1|    2
gi|685048310|emb|LN596822.1|    2
gi|685042496|emb|LN591021.1|    2
gi|685042356|emb|LN590881.1|    2
gi|685042693|emb|LN591218.1|    2
gi|685042150|emb|LN590675.1|    2
gi|685046587|emb|LN595112.1|    2
gi|685047815|emb|LN596327.1|    2
gi|685043256|emb|LN591781.1|    2
gi|685047448|emb|LN595960.1|    2
gi|685045849|emb|LN594374.1|    2
gi|685046646|emb|LN595171.1|    2
gi|685045330|emb|LN593855.1|    2
gi|685044107|emb|LN592632.1|    2
gi|685046334|emb|LN594859.1|    2
gi|685042736|emb|LN591261.1|    2
gi|685043005|emb|LN591530.1|    1
gi|685042183|emb|LN590708.1|    1
gi|685047691|emb|LN596203.1|    1
gi|685042610|emb|LN591135.1|    1
gi|685042780|emb|LN591305.1|    1
gi|685042473|emb|LN590998.1|    1
gi|685042253|emb|LN590778.1|    1
gi|685042641|emb|LN591166.1|    1
gi|685043940|emb|LN592465.1|    1
gi|685044505|emb|LN593030.1|    1
gi|685047500|emb|LN596012.1|    1
gi|685042228|emb|LN590753.1|    1
gi|685043265|emb|LN591790.1|    1
gi|685048142|emb|LN596654.1|    1
gi|685044430|emb|LN592955.1|    1
gi|685042450|emb|LN590975.1|    1
gi|685042461|emb|LN590986.1|    1
gi|685042606|emb|LN591131.1|    1


  1. Great post-publication review. Would you be able to share the details of the steps you followed, such as for collapsing, creating a blast database and the querying done against it? Scripts would be great since it is all about reanalysis and reproducibility.

  2. Huh. How did this one sneak by NCBI curators? They usually run vecscreen on these, which should catch any Illumina adapter (provided they are in UniVec).

    1. NCBI does catch these and remove/trim, report to submitters. if primer is in middle of a contig submitter asked for re-assembly or removal. This genome was submitted to EBI. I'm not sure about EBI specific policy or submission screening procedures, But submitter/owner of any sequence in INSDC database is ultimate authority on its content and welcomed to provide updates or corrections. .

  3. This comment has been removed by the author.

  4. Great QC work! Have you got any feedback from Nature on this?

  5. I'm facing the same problem with my assembled genomes. So I wrote one script to remove the reads which showed hits against Illumina adatpers in blastn anlysis. But when I did blastn analysis again (now using the dataset of reads without those sequences from previous blastn) I did find a new set of reads (different ones) which were not found in previous blastn but now is showing hits against adaptors. Is this happening to anyone else? Graham, did you notice something like this in your analysis?

    1. Hi Erica,
      I've not downloaded and looked at the reads. One way to do this is to look at the output of 'Overrepresented sequences' in the fastqc tool and then remove those sequences, e.g.

  6. I was befuddled as to why random bits of my Illicium plastome were matching with Carp scaffolds in GenBank. This shed a lot of light on it. Thanks!
