De novo assembly of the complex genome of Nippostrongylus brasiliensis using MinION long reads
Received: 12 September 2017
Accepted: 14 December 2017
Published: 11 January 2018
Abstract
Background
Eukaryotic genome assembly remains a challenge in part due to the prevalence of complex DNA repeats. This is a particularly acute problem for holocentric nematodes because of the large number of satellite DNA sequences found throughout their genomes. These have been recalcitrant to most genome sequencing methods. At the same time, many nematodes are parasites and some represent a serious threat to human health. There is a pressing need for better molecular characterization of animal and plant parasitic nematodes. The advent of long-read DNA sequencing methods offers the promise of resolving complex genomes.
Results
Using Nippostrongylus brasiliensis as a test case, applying improved base-calling algorithms and assembly methods, we demonstrate the feasibility of de novo genome assembly matching current community standards using only MinION long reads. In doing so, we uncovered an unexpected diversity of very long and complex DNA sequences repeated throughout the N. brasiliensis genome, including massive tandem repeats of tRNA genes.
Conclusion
Base-calling and assembly methods have improved sufficiently that de novo genome assembly of large complex genomes is possible using only long reads. The method has the added advantage of preserving haplotypic variants and so has the potential to be used in population analyses.
Keywords
Background
Human hookworm infections by the parasitic nematodes Necator americanus and Ancylostoma duodenale continue to be a major global health problem. Next-generation sequencing (NGS) techniques open the door to molecular epidemiological monitoring of nematode and helminth parasites in endemic areas. Such studies are, however, hampered by the heterogeneous nature of parasite populations and by the intrinsically complex genome structures of nematodes [1]. In contrast to most NGS machines, which are cumbersome and can be operated only within a laboratory setting, Oxford Nanopore Technology (ONT) MinION sequencers are small and highly portable. They are robust and are now being used all over the globe, even in extreme environments [2, 3]. That they are well suited for field studies, and their capacity to generate long-read sequences, makes them potentially an ideal tool for conducting molecular epidemiological studies of nematode and helminth parasites in remote locations.
Long-read sequences are especially useful for de novo genome assembly. Reads from the MinION, as well as from PacBio’s single-molecule real-time (SMRT) sequencing platform, however, suffer from an error rate that is very substantially higher than seen with short-read NGS technologies. As a consequence, de novo genome assembly based on long DNA reads often relies on hybrid strategies incorporating, for example, short-read DNA sequencing [4–7], although non-hybrid long-read only methods do exist [8–12] (see [13] for a review). Indeed, there have now been successful chromosome-scale assemblies of large genomes, primarily using the PacBio SMRT platform to generate contigs (e.g. [14]), combined with long-range linking information (e.g. [15, 16]) but the approach can still be challenging.
Life cycle of Nippostrongylus brasiliensis in the laboratory. Parasite stages are not drawn to scale. Adapted with permission from Camberis et al. [35]
Results and discussion
Genome and transcriptome contiguity scores. Statistics for the current WTSI reference sequence are given for comparison
Description | Size (kb) | N50 (kb) | N90 (kb) | Minimum (kb) | Maximum (kb) |
---|---|---|---|---|---|
[Contigs/genes] | [L50] | [L90] | |||
WTSI reference | 294,400 | 33.5 | 4.3 | 0.5 | 394.2 |
[29,375] | [2024] | [11,638] | |||
Uncorrected (Canu only)* | 347,186 | 209.2 | 38.9 | 1.7 | 2048.3 |
[3583] | [415] | [1964] | |||
Trinity [de novo] | 180,448 | 0.8 | 0.3 | 0.2 | 21.5 |
[291,671/172,480]** | [53,613] | [219,557] | |||
Trinity [genome-guided] | 249,065 | 1.2 | 0.3 | 0.2 | 22.0 |
[352,994/236,865]** | [50569] | [245,765] | |||
Trinity [expression-filtered] | 71,457 | 2.0 | 0.7 | 0.2 | 22.0 |
[52,302] | [11,072] | [34,924] | |||
WGA, Canu only | 119,196 | 53.1 | 17.3 | 1.8 | 388.2 |
[3280] | [647] | [2199] |
Effect of extraction and analysis methods on the size of DNA reads. The distribution of read lengths for unamplified DNA extracted by each one of four methods and using two analysis methods is shown. The insert is a cumulative plot of the same data
Yields of sequences using different extraction and analysis methods. The percentage of the raw reads that were included in the output of the two methods (MinKNOW and Canu v1.4, or Albacore and Canu v1.5) is shown together with the total sequence lengths and the number of reads exceeding the indicated lengths. See ‘Methods’ for experimental details and Additional file 1: Table S1 for more information
Yield | |||||||||
---|---|---|---|---|---|---|---|---|---|
Preparation Method | Raw reads | MinKNOW | Albacore | ||||||
% | Gb | >50 kb | >100 kb | % | Gb | >50 kb | >100 kb | ||
1. Sonication, DNeasy | 930,957 | 99.9 | 2.375 | 80 | 15 | 99.1 | 2.603 | 472 | 82 |
2. Sonication, G2, Tip20 | 1,845,520 | 85.2 | 2.289 | 2252 | 17 | 98.0 | 3.002 | 4888 | 1006 |
3. Direct lysis, DNeasy | 451,578 | 99.9 | 1.723 | 179 | 26 | 98.8 | 2.028 | 1710 | 775 |
4. Pipette squashed, G2, Tip20 | 2,333,064 | 83.9 | 3.834 | 10 | 0 | 98.4 | 4.678 | 427 | 89 |
Improved sequence fidelity using the updated base caller. a Alignment of a single 74-kb read against the corresponding scaffold of the current WTSI reference genome, plotted using Kablammo [36]. b Identification of a very long stretch of complex tandem repeats (21 kb with a 171-bp repeat unit) within the same read. WTSI Wellcome Trust Sanger Institute
Improved sequence fidelity using updated methods. Homopolymeric regions that were compressed in the original MinKNOW-called sequence (top) were expanded when the base caller Albacore included a transducer mode that incorporates signal length into the called sequence (bottom). The consensus sequence from the final assembly (from contig tig00023109) is shown together with the same set of individual reads, called with MinKNOW (top) or Albacore I (bottom). Sanger sequencing of PCR amplicons confirmed the accuracy of the newer base-calling, specifically the stretch of nine A’s, marked by the black bar, which was not supported by a single read in the MinKNOW data. Some of the sequence variation here may reflect genuine genotypic variation and not simply sequencing errors
Improved assembly using updated methods. Distribution of fragment sizes and complexity for contigs assembled using old (MinKNOW/Canu v1.4; a) and new (Albacore/Canu v1.5; b) methods with unamplified DNA. The structures of the assembled contigs (drawn to scale, as indicated) are also shown. The unbranched subgraphs representing more than 70% of the sequence (dark magenta bars) have been omitted. The distribution of fragment sizes from the WTSI reference genome (dashed dotted line) is shown in (b). The largest two subgraphs in (a) contain 207 contigs (total of 28.7 Mb) and 101 contigs (17.3 Mb) and in (b) 106 contigs (13.3 Mb) and 76 contigs (10.6 Mb). In addition to the better resolution of subgraphs, in the new assembly, there is a marked increase in unbranched subgraphs longer than 1 Mb. This improvement in contiguity reflects the increase in the quality and quantity of called reads, and better assembly algorithms. WTSI Wellcome Trust Sanger Institute
Analysis of repeat sequences within different assemblies. SATFIND [23] was used on contigs >2.5 kb. The total length of each region of a repeated DNA sequence is plotted against the repeat’s unit length, for the WTSI genome assembly (a) and our final assembly (b). The orange and red lines are at 150 bp (typical maximum read length for an Illumina HiSeq run) and 650 bp, respectively. Any VeCTRs with a unit length longer than 150 bp would not be identifiable as a repetitive sequence on an Illumina sequencer. Any VeCTRs with a region length longer than the maximum target fragment length for Illumina sequencing (typically 650 bp) will be collapsed into a shorter region if only non-mate-paired 150 bp reads are used for the assembly. c Alignment of the region corresponding to the longest repeat unit length in (b), with each base represented as a colored line. Although the resolution of repeats is greatly improved compared to the WTSI assembly, such sequences still present a challenge for assembly, as evidenced by the fact that the 5' end of this sequence corresponds to a contig end (tig00023164; coordinates on the left). The color code is as in Fig. 4. Very long stretch of a complex tandem repeat (VeCTR), Wellcome Trust Sanger Institute (WTSI)
Classification of subgraphs and identification of a haplotype signature. a Bandage plots for the 28 simple GFA subgraphs made of three contigs. The box on the right is an enlarged view of a heterozygous branch subgraph with a total length of 451 kb. b Dot plots of LAST all-against-all minimum-distance sequence comparisons between the three constituent contigs, delineated by the gray lines, for representatives for each of the subgraph classes. The sum of the contig lengths is indicated. The 451-kb subgraph is that enlarged in (a). c Multiple alignment of contig sequences from this subgraph and corresponding region of the DNA reads that map to them, 300 bp either side of the defining deletion. The color code is as in Fig. 4. Graphical Fragment Assembly (GFA) (file format); describes the assembly graph, Very long stretch of a complex tandem repeat (VeCTR)
BUSCO scores following different methods of genome sequence correction. The percentage of complete USCOs together with the number of single, duplicated, fragmented (Frag) or missing USCO predicted from the uncorrected genome sequence, or the sequence after processing by different methods (see ‘Methods’ for details). The figures for the WTSI reference are shown for comparison. The default analysis mode is the short one. In the long mode, an optimization of the BUSCO sub-program Augustus is used for self-training, substantially increasing run times
Short BUSCO | Long BUSCO | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Description | Complete (%) | Single (n) | Duplicated (n) | Fragmented (n) | Missing (n) | Complete (%) | Single (n) | Duplicated (n) | Fragmented (n) | Missing (n) |
Uncorrected (Canu only) | 49.2 | 437 | 46 | 128 | 371 | 65.0 | 585 | 53 | 122 | 222 |
Nanopolish | 85.4 | 733 | 106 | 73 | 70 | |||||
Bowtie2 + Pilon | 75.6 | 630 | 112 | 95 | 145 | 85.1 | 709 | 127 | 72 | 74 |
HISAT2 + Pilon | 74.7 | 618 | 116 | 101 | 147 | 81.5 | 679 | 122 | 83 | 98 |
Trinity# [de novo] | 90.8 90.7 | 681 441 | 211 450 | 65 64 | 25 27 | 91.3 | 446 | 451 | 59 | 26 |
Trinity# [genome-guided] | 88.4 88.2 | 316 304 | 552 562 | 63 61 | 51 55 | 88.5 | 305 | 564 | 58 | 55 |
Trinity# [Expression-filtered] | 87.9 87.7 | 644 613 | 219 248 | 58 60 | 61 61 | 88.3 | 618 | 249 | 55 | 60 |
Trinity# [collapsed] | 87.2 87 | 786 791 | 71 64 | 58 63 | 67 64 | 87.6 | 796 | 64 | 57 | 65 |
WTSI reference | 73.1 | 677 | 41 | 133 | 131 | 80.6 | 748 | 43 | 125 | 66 |
Sequence correction statistics. The number and proportion of classes of sequence corrections are given, calculated using the total number of genome bases mapped by the cDNA reads at coverage >10 (77,335,202). INDELs of three or more were binned together as their individual contribution to variance was less than 1%
Count | Proportion (of variants) | Proportion (of bases) | Original | Corrected |
---|---|---|---|---|
312,140 | 44.2% | 0.404% | . | N [1 bp INS] |
65,725 | 9.3% | 0.085% | A | G |
64,702 | 9.2% | 0.084% | T | C |
48,786 | 6.9% | 0.063% | .. | NN [2 bp INS] |
33,471 | 4.7% | 0.043% | C | T |
33,348 | 4.7% | 0.043% | G | A |
33,163 | 4.7% | 0.043% | N | . [1 bp DEL] |
19,452 | 2.8% | 0.024% | …+ | NNN+ [3+ bp INS] |
13,435 | 1.9% | 0.017% | T | A |
13,215 | 1.9% | 0.017% | A | T |
11,925 | 1.7% | 0.015% | T | G |
11,617 | 1.6% | 0.015% | A | C |
9681 | 1.4% | 0.013% | G | C |
9524 | 1.3% | 0.012% | C | A |
9469 | 1.3% | 0.012% | C | G |
9442 | 1.3% | 0.012% | G | T |
4188 | 0.6% | 0.005% | NN | .. [2 bp DEL] |
3444 | 0.5% | 0.004% | NNN+ | …+ [3+ bp DEL] |
Since in the current study, the Trinity [genome-guided]-based approach using RNA-seq data gave a marginally higher complement of USCO genes, we used this gene set for further analyses. In addition to an elevated proportion of fragmented USCOs (see below), we also noted a high proportion of duplicated USCOs. Inspection revealed that some of these were bona fide lineage-specific expansions. For example, the analysis uncovered three distinct loci encoding isoforms of fructose 1,6-bisphosphatase (PFAM: PF00316), as predicted also from the WTSI assembly. Pairs of USCOs were also found on homologous contigs. There were four such examples in the 12 heterozygous branch subgraphs alone; this presumably reflects haplotypic variants.
Repeat sequence confounds Trinity gene prediction. a CAZy analysis indicated that two adjacent Trinity-generated transcripts (c16_g1_i2 and c16_g1_i1, left and right, respectively; exons indicated by blue boxes) had been incorrectly predicted since they correspond to C- and N-terminal fragments of a single CAZyme gene (of the GT33 family). The read coverage of the exons for the two predicted transcripts was similar, supporting a single-gene model. BUSCO erroneously called c16_g1_i1 as complete (EOG091H03EM0). b A self-map dot plot of the genomic context in which these transcripts reside demonstrated a high degree of sequence similarity throughout the region, where the two predicted transcripts were split by a palindromic repeat motif that switched directions at the intersection of the two transcripts. The coordinates (in kbp) on the contig tig000206 are shown
As explained above, during the long-term culture of N. brasiliensis, attempts are made to maintain genotypic diversity in the population. The assembly that we produced reflects this. Surgical transplantation of single adult parasitic worms has very recently been shown to be feasible, allowing controlled matings and the establishment of inbred lines [30]. A less demanding alternative approach to describe the different genotypes within a population would be to assemble separate genome sequences directly from DNA extracted from individual nematodes. Current sequencing technologies are not sufficiently sensitive to allow this. Whole-genome amplification (WGA) was, therefore, used to generate sufficient DNA from a single adult worm. This was then sequenced with a single MinION run. Base-calling using Albacore produced 2,074,871 reads for 6.5 Gb of sequence (compared to 1,722,835 reads totaling 4.9 Gb of sequence using the older MinKNOW approach). Based on the initial FASTQ files, the 90th percentile of length for the WGA sequencing run was 6.4 kb, with more than 27,000 reads of at least 13 kb. The overall read length distribution for the WGA run was somewhat shifted to smaller sizes (Additional file 6: Figure S3), presumably a consequence of Phi29 processivity.
Read mapping statistics
Read category | WGA | Unamplified | ||
---|---|---|---|---|
Assembly | WGA | Final | WGA | Final |
Read count | 1,362,775 | 1,362,775 | 2,137,475 | 2,137,475 |
Mapped | 1,106,671 | 1,233,514 | 1,525,846 | 1,922,031 |
Well mapped1 | 571,180 | 727,433 | 563,302 | 1,386,713 |
Uniquely mapped2 | 16,977 | 143,820 | 16,568 | 412,753 |
Median mapped read proportion3 | 91.1% | 94.1% | 65.6% | 96.7% |
Median mapped proportion excluding well-mapped reads | 56.1% | 65.9% | 31.1% | 67.2% |
Since the WGA reads that mapped only to the final assembly (in regions of low sequence coverage) represented 11% of the total number, sequencing depth was not a major limiting factor. As the stringent pairwise mapping also indicated that the WGA assembly captured a third (32.4%) of the predicted genome, it appears that genome amplification was indeed biased; future attempts should use alternate amplification (e.g. primer-free) approaches. The results do, however, illustrate that assembling a complete genome from a single nematode is within the bounds of current technology and this method could be used to survey genotypic diversity in populations.
Conclusions
For researchers interested only in coding genes and their expression, standard short-read RNA-seq will continue to be an efficient and appropriate technology. As we showed here, however, Illumina-type DNA sequencing is particularly poor as a basis for assembling genomes rich in long repeats. On the other hand, our results illustrate that a de novo assembly of high quality can be obtained using only long reads, even of a complex genome from a heterogeneous population, using a very modest sequencing depth (24× after trimming and correction). At the local level, the incorporation of sequencing polishing, using only MinION reads, raises sequence quality close to that seen with Illumina-based approaches. The results revealed the need for further improvements in resolving ambiguous contig architectures and in transcript-directed gene structure prediction. Nevertheless, since haplotypic variation could be detected even without RNA-seq-directed sequence correction, they clearly show the potential for using this approach to profile parasite populations, opening the way for detailed molecular epidemiological studies.
Methods
Canu 1.5
Changes to Canu relative to v1.4 are documented in the v1.5 release notes (available at https://github.com/marbl/canu/releases/tag/v1.5). Briefly, one major change was a switch in the alignment algorithm used to generate consensus for the corrected reads and final contigs. This improved the corrected reads by slightly increasing their identity and splitting fewer reads. Canu also started using the raw read overlaps to estimate corrected read lengths and used that information to inform its selection of reads for correction. As with all such analysis packages, bugs continue to be identified and corrected. The first release of v1.5, for example, included a bug that erroneously removed heterozygous edges such that in the type of subgraphs shown in Fig. 4a, the results presented would potentially correspond to lower bounds of complexity. This and other issues have been corrected in recent development versions of Canu and the latest public release.
Nematode cultures
The N. brasiliensis strain used in this study was originally sourced from Lindsey Dent (University of Adelaide) and maintained at the Malaghan Institute for 22 years by serial passage through female Lewis rats [31]. Ethics approval is overseen and approved by the Animal Ethics Committee of Victoria University of Wellington. The strain used for the WTSI reference genome is not fully documented. In principle, it derives from a line that had been maintained serially at the National Institute for Medical Research at Mill Hill by Bridget Ogilvie starting in the early 1970s and then by Rick Maizel, at Imperial College. Murray Selkirk continued to maintain the strain at Imperial after R. Maizel moved to the University of Edinburgh, where he established parallel lines. These Edinburgh lines were supplemented on a number of occasions with cultures from Imperial College. The standard cycle involved injection of 4000–6000 L3 larvae into Sprague–Dawley rats, but was otherwise similar to that used at the Malaghan (R. Maizel, personal communication).
DNA preparation
- 1.
Sonication, DNeasy: Worms were disrupted by sonication and then processed using a Qiagen DNeasy kit (ATL/AL, 30 min at 56 °C). This yielded 900 ng of DNA and 1800 ng of RNA.
- 2.
Sonication, G2, Tip20: Worms were disrupted by sonication, then lysed using Qiagen buffer G2 (30 min at 56 °C). The lysate was purified using a Qiatip 20 anion exchange column. This yielded 800 ng of DNA.
- 3.
Direct lysis, DNeasy: Intact worms were added without prior disruption directly into Qiagen buffer G2 (30 mins at 56 °C). The lysate was processed using a Qiagen DNeasy kit (ATL/AL, 30 min at 56 °C). This yielded 700 ng of DNA and 17,700 ng of RNA.
- 4.
Pipette squashed, G2, Tip20: Worms were mashed against the side of the sample tube using a pipette tip. The cells then underwent chemical lysis in the Qiagen buffer G2 (120 min at 56 °C) and lysate was purified using the Qiatip 20 anion exchange column. The DNA was fragmented using a Covaris G-tube prior to library preparation. This yielded 1100 ng of DNA.
All samples were subsequently processed using the ONT 1D Ligation Sequencing Kit (SQK-LSK108), ligating the ONT adapter mix onto end-prepped and dA-tailed DNA. Each prepared library was loaded onto a different MinION flow cell for sequencing.
Genome assembly
-
chimaera version = 1.23.4
-
dragonet version = 1.23.0
-
event_detection = Analyses/EventDetection_000
-
name = ONT Sequencing Workflow
-
segmentation = Analyses/Segment_Linear_000
-
time_stamp = 2017-Jan-21 07:55:13
The raw reads were recalled using Albacore 1.1.0, which implements a time-domain correction (transducer) for homopolymeric regions in the called sequence. This is also available in the open-source Scrappie (https://github.com/nanoporetech/scrappie) base caller:
for x in $(ls -d [CFED]_??????); do echo ${x}; read_fast5_basecaller.py -t 6 -i ${x} -s called_${x} -o fastq -c r94_450bps_linear.cfg; done
Illumina reads for the WTSI genome assembly were retrieved from the Sequence Read Archive (accession ERR063640). These reads were processed into k-mer counts using Jellyfish v2.2.6 [32], and the resulting histogram used in GenomeScope (see below):
jellyfish count -C --bf-size 20G -t 6 -s 2G -m 21 <(zcat ERR063640_1P.fastq.gz) <(zcat ERR063640_2P.fastq.gz) -o ERR063640_mer_counts.jf
jellyfish histo ERR063640_mer_counts.jf > ERR063640_mer_counts.histo
Using the FASTQ files, rather than just reads from the called pass bin, allowed a maximum amount of sequence information to be recovered. Reads were trimmed by 65 bp at each end to exclude adapters (see below), then Canu v1.5 was run with default parameters. The estimated genome size was 205 Mb as determined using GenomeScope (qb.cshl.edu/genomescope/) and the WTSI Illumina reads. Long-read sequencing of a heterogeneous population would be expected to generate a longer genome size due to the separation of haplotypes. We, therefore, made a conservative estimate of 300 Mb. The assumed genome size alters how many reads are selected for the final Canu assembly. If the coverage of corrected reads is greater than 40×, then only the longest reads are used (up to 40× coverage). In our case, the coverage was below 40× regardless of what parameters were used, so it would not be expected to alter the outcome. Within Canu, assembly parameters are adjusted when read counts are below 20× (e.g. corMinCoverage = 0) as previously described [33]:
## trim reads
pv called_[CFED]_*_albacore_1.1.0.fq.gz | zcat | ~/scripts/fastx-fetch.pl -t 65 | gzip > called_CFED_65bptrim_albacore_1.1.0.fq.gz
## run Canu
~/install/canu/canu-1.5/Linux-amd64/bin/canu -nanopore-raw called_CFED_65bptrim_albacore_1.1.0.fq.gz -p Nb_ONTCFED_65bpTrim_t1 -d Nb_ONTCFED_65bpTrim_t1 genomeSize=300 M
Bowtie2 was used in local mode to map RNA-seq reads to the assembled genome contigs:
bowtie2 -p 10 --local -x Nb_ONTCFED_65bpTrim_t1.contigs.fasta -1 ../1563-all_R1_trimmed.fastq.gz -2 ../1563-all_R2_trimmed.fastq.gz | samtools sort > 1563_vs_uncorrected_NOCFED.bam
Pilon was used to correct based on the RNA-seq mapping to the genome, with structural reassembly disabled (in case it collapsed introns):
java -Xmx40G -jar ~/install/pilon/pilon-1.22.jar --genome Nb_ONTCFED_65bpTrim_t1.contigs.fasta --frags 1563_vs_uncorrected_NOCFED.bam --fix snps,indels --output BT2Pilon_NOCFED --gapmargin 1 --mingap 10000000 --threads 10 --changes 2>BT2Pilon_NOCFED.stderr.txt 1>BT2Pilon_NOCFED.stdout.txt
Contigs that were entirely composed of homopolymer sequences were identified using grep and removed from the assembly:
## identify homopolymer (and binary division-rich) regions
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl > hplength_BT2Pilon_NOCFED.txt
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl -mode YR > hplength_YR_BT2Pilon_NOCFED.txt
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl -mode SW > hplength_SW_BT2Pilon_NOCFED.txt
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl -mode MK > hplength_MK_BT2Pilon_NOCFED.txt
## example grep hunt for repeated sequence
cat BT2Pilon_NOCFED.fasta | grep -e '^[AT]\{80\}' -e '^>' | grep --no-group-separator -B 1 '^[AT]\{80\}' | ~/scripts/fastx-length.pl
## exclude contigs and sort by length
~/scripts/fastx-fetch.pl -v tig00010453 tig00024413 tig00024414 tig00023947 | ~/scripts/fastx-sort.pl -l > Nb_ONTCFED_65bpTrim_t1.contigs.hpcleaned.fasta
This produced the final assembly described in the paper. At this stage, we had an assembled genome, but validation of the genome was difficult. We decided to carry out a draft genome-guided transcriptome assembly with Trinity. To evaluate the completeness of the genome, we focused on expressed genes and used a set of Illumina RNA-seq reads to perform a genome-guided transcriptome assembly using Trinity.
The RNA-seq reads were remapped to the corrected assembly for genome-guided Trinity:
bowtie2 -p 10 -t --local --score-min G,20,8 -p 10 -x BT2Pilon_NOCFED_hpcleaned.fasta --rf -X 15000 -1
../1563-all_R1_trimmed.fastq.gz -2 <(pv../1563-all_R2_trimmed.fastq.gz | zcat) 2>bowtie2_1563_vs_BNOCFED_hp.summary.txt | samtools sort > bowtie2_1563_vs_BNOCFED_hp.bam
## Trinity assembly; assume introns can be up to 15 kb in length
~/install/trinity/trinityrnaseq-Trinity-v2.4.0/Trinity --CPU 10 --genome_guided_bam bowtie2_1563_vs_BNOCFED_hp.bam --genome_guided_max_intron 15000 --max_memory 40G --SS_lib_type RF --output trinity_BNOCFED
The assembly that Trinity generated had similar completeness (as measured by BUSCO) to a de novo assembly generated using the same RNA-seq reads (see Table 3). The assembly was, however, very large, with over 350 k contigs (see Table 1), most likely due to isoform fragments being included in the transcriptome. We carried out additional filtering steps, reducing the number of assembled contigs while maintaining similar BUSCO completeness scores.
Transcript filtering
Our RNA-seq data is publicly available from the European Nucleotide Archive of the European Bioinformatics Institute (https://www.ebi.ac.uk/ena/data/view/ERS1809079), where details of the samples can be found.
The estimated numbers of mapped RNA-seq reads (as predicted by Salmon [34]) were used to filter transcripts, because the genome-guided assembly was based on RNA-seq expression. Salmon uses a pseudo-alignment approach and was chosen to correct for reads mapping to the same sequence, but on different isoforms and/or genes rather than alternative alignment-based approaches that are designed around the expectation of a relatively complete and non-repetitive genome. Since this analysis was to estimate the expression distribution, however, the specific mapping strategy is not expected to influence greatly the outcome. RNA-seq reads were mapped to the Bowtie2/Pilon genome-guided assembly, and a threshold of 50 reads for true expression of complete BUSCO genes was chosen for filtering transcripts from the genome-guided Trinity transcriptome. The longest open reading frame from each transcript was extracted to generate a protein sequence for further collapsing using cdhit to remove protein sequences that had at least 98% identity to a longer protein, leaving 56,980 transcripts. Each of these steps are outlined below.
The RNA-seq reads were mapped to the Trinity-generated transcripts using Salmon:
## create Salmon index
~/install/salmon/Salmon-0.8.2_linux_x86_64/bin/salmon index -t Trinity-BNOCFED.fasta -i Trinity-BNOCFED.fasta.sai
## quantify transcript coverage with Salmon
~/install/salmon/Salmon-0.8.2_linux_x86_64/bin/salmon quant -i Trinity-BNOCFED.fasta.sai -1 ../../1563-all_R1_trimmed.fastq.gz -2 ../../1563-all_R2_trimmed.fastq.gz -p 10 -o quant/1563-all_quant -l A
The expression of BUSCO genes was used to set a credible signal cutoff (Additional file 7: Figure S4). A threshold of 50 mapped reads was chosen, as 98% of complete BUSCO sequences had more than 50 mapped reads. The transcripts were subsetted based on their expression scores:
## Subset transcripts based on a threshold of 50 counts
pv Trinity-GG.fasta | ~/scripts/fastx-fetch.pl -i HighCount_Transcripts_Num50.txt > HighCount50_TBNOCFED.fasta
The longest Met to Stop open reading frame was identified for each transcript for protein-based clustering with cdhit:
pv HighCount50_TBNOCFED.fasta | getorf -find 1 -noreverse -sequence/dev/stdin -outseq/dev/stdout | ~/scripts/fastx-isofilter.pl -o > longest_MetStopORF_HC50_TBNOCFED.fasta
## run cdhit
cdhit -T 10 -c 0.98 -i longest_MetStopORF_HC50_TBNOCFED.fasta -o cdhit_0.98_LMOHC50_TBNOCFED.prot.fasta
## identify names of longest representative proteins for each cluster
grep '^>' cdhit_0.98_LMOHC50_TBNOCFED.prot.fasta | perl -pe 's/^>//' > cdhit_0.98_LMOHC50_TBNOCFED.prot.names.txt
## fetch transcripts associated with the representative proteins
pv HighCount50_TBNOCFED.fasta | ~/scripts/fastx-fetch.pl -i cdhit_0.98_LMOHC50_TBNOCFED.prot.names.txt > cdhit_0.98_LMOHC50_TBNOCFED.tran.fasta
The longest isoform for each gene was identified, producing an isoform-collapsed transcriptome subset, which was clustered at the protein level by cdhit at 90% identity:
## extract longest protein for each transcript
cat cdhit_0.98_LMOHC50_TBNOCFED.prot.fasta | ~/scripts/fastx-isofilter.pl > LI_CD98LMOHC50_TBNOCFED.prot.fasta
## cluster at 90% via CDHIT
cdhit -i LI_CD98LMOHC50_TBNOCFED.prot.fasta -o CDLI_CD98LMOHC50_TBNOCFED.prot.fasta
## find associated transcripts
grep '^>' CDLI_CD98LMOHC50_TBNOCFED.prot.fasta | perl -pe 's/^>//' > CDLI_CD98LMOHC50_TBNOCFED.names.txt
cat cdhit_0.98_LMOHC50_TBNOCFED.tran.fasta | ~/scripts/fastx-fetch.pl -i CDLI_CD98LMOHC50_TBNOCFED.names.txt > CDLI_CD98LMOHC50_TBNOCFED.tran.fasta
BUSCO was run on the collapsed transcripts to provide one measure of genome completeness:
## run BUSCO on isoform-collapsed transcripts in long genome mode
python ~/install/busco/BUSCO.py -i../CDLI_CD98LMOHC50_TBNOCFED.tran.fasta -o BUSCO_longgeno_CDLI_CD98LMOHC50_TBNOCFED_nematodes -l \
~/install/busco/nematoda_odb9 -m geno -c 10 --long
## run BUSCO on isoform-collapsed transcripts in transcript mode
python ~/install/busco/BUSCO.py -i../CDLI_CD98LMOHC50_TBNOCFED.tran.fasta -o BUSCO_tran_CDLI_CD98LMOHC50_TBNOCFED_nematodes -l \
~/install/busco/nematoda_odb9 -m tran -c 10
This filtered set had very similar BUSCO scores to the original genome-guided Trinity assembly, despite the number of transcripts being reduced down to 1/6 of their original count, and the length of the transcriptome being reduced down to less than 1/3 of its original size. The number of duplicated BUSCO genes in the filtered set suggests that this set could probably be made smaller with a looser cd-hit-est clustering, although there is a chance that such a reduction may cause gene copies to be clustered together.
The missing USCOs were remapped to our assembled genome to determine whether any of the corresponding genes were actually present in the genome:
blastx -num_threads 10 -query <(pv/mnt/gg_nanopore/gringer/ONT_Jan17/Nb_ONTCFED_65bpTrim_t1/Nb_ONTCFED_65bpTrim_t1.contigs.fasta)
-db missing_busco_list_intersectionBUSCO_nematodes.fasta -outfmt 6 -evalue 1e-3 > BLAST_NOCFED_vs_BUSCO_missing_intersection.tsv
Hits for the same contig/USCO combination were merged using a custom script, and sequence extracted from the assembly to cover the entire matched region:
(~/scripts/blastx_boundaries.pl BLAST_NOCFED_vs_BUSCO_missing_intersection.tsv | while read a b; do samtools faidx/mnt/gg_nanopore/gringer/ONT_Jan17/Nb_ONTCFED_65bpTrim_t1/Nb_ONTCFED_65bpTrim_t1.contigs.fasta ${b} | perl -pe "s/^>/>${a}-/"; done) > BLAST_NOCFED_vs_BUSCO_missing_intersection.fasta
Whole-genome amplification and assembly
DNA extracted from a single adult worm was amplified using a Qiagen Midi RepliG kit. Raw reads and called FASTQ files were obtained from ONT (called by workflow “1D RNN Basecalling 450 bps FastQ v1.121”) following sequencing. Reads were filtered to exclude those from contaminating DNA using OneCodex (http://onecodex.com). Reads that mapped to any DNA in the OneCodex database were excluded from the read set:
(zcat OneCodex_RefSeq_132394.fastq.gz.results.tsv.gz | awk '{if($3 == 0){print $1}}'; zcat OneCodex_OCD_132394.fastq.gz.results.tsv.gz | awk '{if($3 == 0){print $1}}') | sort | uniq -d | gzip > OCunmapped_names_132394.txt.gz
pv 132394.fastq.gz | zcat | ~/scripts/fastx-fetch.pl -i OCunmapped_names_132394.txt.gz | ~/scripts/fastx-fetch.pl -v -i ONTmapped_names_132394.txt.gz | gzip > OCunmapped_ONTunmapped_132394.fastq.gz
Filtered reads from the WGA sample were called using Albacore 1.1.0:
read_fast5_basecaller.py -o fastq -i A_132394 -t 10 -s called_A_132394 -c r94_450bps_linear.cfg
Reads with a length of greater than 10 k were extracted for subsequent analysis:
pv called_A_132394_albacore_1.1.0.fq.gz | zcat | ~/scripts/fastx-fetch.pl --min 10000 | gzip > 10k_called_A_132394_albacore_1.1.0.fq.gz
To define the region of raw nanopore sequences for adapter exclusion, the >10 k reads were mapped to 50 M reads that had been generated by WTSI and that had been used for the existing WTSI assembly:
bowtie2 -p 10 --no-unal --no-mixed --local -x 10k_called_A_132394_albacore_1.1.0.fa -1 <(pv ~/bioinf/MIMR-2017-Jan-01-GBIS/GLG/ONT/aws/Sampled_50M_ERR063640.R1.fq.gz | zcat) -2 ~/bioinf/MIMR-2017-Jan-01-GBIS/GLG/ONT/aws/Sampled_50M_ERR063640.R2.fq.gz | samtools sort > WTSI_Sampled_50M_vs_10k_called_A_132394.bam
## find position of first mapped Illumina read for each nanopore read
pv WTSI_Sampled_50M_vs_10k_called_A_132394.bam | samtools view - | awk '{print $3,$4}' | sort -k 1,1 -k 2,2n | sort -u -k 1,1 | gzip > firstHit_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz
## determine position of last mapped Illumina read
pv WTSI_Sampled_50M_vs_10k_called_A_132394.bam | samtools view - | awk '{print $3,$4}' | sort -k 1,1 -k 2,2rn | sort -u -k 1,1 | gzip > lastHit_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz
## count positions of first reads
zcat firstHit_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz | awk '{print $2}' | sort -n | uniq -c | gzip > firstBase_counts_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz
When we mapped the 5' ends of Illumina reads to the start of nanopore reads with Bowtie2, there was a common register shift of 28–32 bases, corresponding to the presence of adapter sequences in the nanopore reads. Reads were conservatively trimmed by 65 bases at each end to exclude adapters:
pv called_A_132394_albacore_1.1.0.fq.gz | zcat | ~/scripts/fastx-fetch.pl --min 1130 --max 1000000 | \
~/scripts/fastx-fetch.pl -t 65 | gzip > 65bpTrim_called_A_132394_albacore_1.1.0.fq.gz
Canu v1.5 was used to assemble the trimmed reads. The assembly was done in stages (with an assembly at each stage) to determine whether or not particular stages were redundant for the assembly:
## attempt assembly-only with Canu v1.5
~/install/canu/canu-1.5/Linux-amd64/bin/canu -assemble -nanopore-raw 65bpTrim_called_A_132394_albacore_1.1.0.fq.gz -p Nb_ONTA_65bpTrim_t1 -d
Nb_ONTA_65bpTrim_t1 genomeSize=300 M
## attempt assembly + correction
~/install/canu/canu-1.5/Linux-amd64/bin/canu -assemble -nanopore-corrected 65bpTrim_called_A_132394_albacore_1.1.0.fq.gz -p Nb_ONTA_65bpTrim_t2 -d Nb_ONTA_65bpTrim_t2 -correct genomeSize=300 M
## attempt stringent trim with corrected reads
~/install/canu/canu-1.5/Linux-amd64/bin/canu -trim-assemble -p Nb_ONTA_65bpTrim_t3 -d Nb_ONTA_65bpTrim_t3 genomeSize=300 M -nanopore-corrected Nb_ONTA_65bpTrim_t2/Nb_ONTA_65bpTrim_t2.correctedReads.fasta.gz -trim-assemble trimReadsOverlap=500 trimReadsCoverage=5 obtErrorRate=0.25
An alternative, less-stringent overlap was also attempted (with trimReadsCoverage = 2), but resulted in a less complete assembly. The results of an analysis surrounding the different assemblies suggested that the default Canu assembly process of correction, trimming, then assembly produced the best outcome, namely the WGA assembly presented in Table 1.
Canu includes a step of normalization, reducing the shortest reads, if coverage is over a predefined threshold, but that normalization threshold was not triggered for our assembly. We did not attempt a k-mer-based normalization, since this would not resolve the incomplete genome sequence coverage. Selective amplification combined with random sampling by the sequencer means that poorly amplified regions were unlikely to be present in the sequenced reads. We did notice that an assembly combining both WGA and unamplified DNA samples produced a more fragmented genome, which is why a combined assembly is not presented here. It is possible that k-mer normalization of the WGA reads might improve such a hybrid assembly.
Declarations
Acknowledgements
We thank Rick Maizel for a discussion, and Simon Mayes and colleagues at ONT for their assistance. This study utilized the computational resources of the Biowulf system at the National Institutes of Health, Bethesda, MD (https://hpc.nih.gov/systems/). The Cloud Infrastructure for Microbial Bioinformatics (CLIMB) service in the U.K. and the Genomics Research Centre server at Queensland University of Technology (QUT) were used to strip out event data and facilitate data transfer.
Funding
This study was supported by program grant funding from the Health Research Council of New Zealand (14/003) and the Marjorie Barclay Trust. SK was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. The funding bodies had no role in the design of the study, nor in the collection, analysis, and interpretation of data, nor in writing the manuscript.
Availability of data and materials
MinION reads (base-called FASTQ and raw) used for this assembly project (as well as RNA-seq reads used for assembly correction) are available from the European Nucleotide Archive, project PRJNA328296. Accessory scripts used for sequence and assembly analysis (see ‘Methods’) can be found at https://github.com/gringer/bioinfscripts.
Authors' contributions
DE, GLG, and JJE conceived and designed the experiments. DE, JC, and MC performed the experiments. DE, BH, SK, and JJE analyzed the data. DE and SK contributed reagents, materials, analysis, and tools. DE and JJE wrote the paper. All authors read and approved the final manuscript.
Ethics approval
Ethics approval was provided by the Animal Ethics Committee of Victoria University of Wellington.
Consent for publication
Not applicable
Competing interests
The authors declare no competing interests but note that ONT, the manufacturers of the MinION device, carried out DNA sample preparation and sequencing free of charge, and have covered travel and accommodation expenses for SK to speak at ONT conferences.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Wit J, Gilleard JS. Resequencing helminth genomes for population and genetic studies. Trends Parasitol. 2017;33(5):388–99.View ArticlePubMedGoogle Scholar
- Johnson SS, Zaikova E, Goerlitz DS, Bai Y, Tighe SW. Real-time DNA sequencing in the Antarctic dry valleys using the Oxford Nanopore sequencer. J Biomol Tech. 2017;28(1):2–7.PubMedPubMed CentralGoogle Scholar
- Edwards A, Debbonaire AR, Sattler B, Mur LA, Hodson AJ. Extreme metagenomics using nanopore DNA sequencing: a field report from Svalbard, 78 N. bioRxiv. 2016. https://doi.org/10.1101/073965.
- Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012;30(7):693–700.View ArticlePubMedPubMed CentralGoogle Scholar
- Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Marcais G, et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 2017;27(5):787–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Ye C, Hill CM, Wu S, Ruan J, Ma ZS. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Sci Rep. 2016;6:31900.View ArticlePubMedPubMed CentralGoogle Scholar
- Jansen HJ, Liem M, Jong-Raadsen SA, Dufour S, Weltzien FA, Swinkels W, et al. Rapid de novo assembly of the European eel genome from nanopore sequencing reads. Sci Rep. 2017;7(1):7213.View ArticlePubMedPubMed CentralGoogle Scholar
- Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015;33(6):623–30.View ArticlePubMedGoogle Scholar
- Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103–10.View ArticlePubMedPubMed CentralGoogle Scholar
- Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods. 2016;13(12):1050–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013;10(6):563–9.View ArticlePubMedGoogle Scholar
- Koren S, Harhay GP, Smith TP, Bono JL, Harhay DM, McVey SD, et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 2013;14(9):R101.View ArticlePubMedPubMed CentralGoogle Scholar
- Koren S, Phillippy AM. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol. 2015;23:110–20.View ArticlePubMedGoogle Scholar
- Gordon D, Huddleston J, Chaisson MJ, Hill CM, Kronenberg ZN, Munson KM, et al. Long-read sequence assembly of the gorilla genome. Science. 2016;352:aae0344.View ArticlePubMedPubMed CentralGoogle Scholar
- Bickhart DM, Rosen BD, Koren S, Sayre BL, Hastie AR, Chan S, et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet. 2017;49(4):643–50.View ArticlePubMedGoogle Scholar
- Jarvis DE, Ho YS, Lightfoot DJ, Schmockel SM, Li B, Borm TJ, et al. The genome of Chenopodium quinoa. Nature. 2017;542(7641):307–12.View ArticlePubMedGoogle Scholar
- Sotillo J, Sanchez-Flores A, Cantacessi C, Harcus Y, Pickering D, Bouchery T, et al. Secreted proteomes of different developmental stages of the gastrointestinal nematode Nippostrongylus brasiliensis. Mol Cell Proteomics. 2014;13(10):2736–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Kassai T. Handbook of Nippostrongylus brasiliensis (nematode). Slough, Berkshire: Commonwealth Agricultural Bureaux; 1982.Google Scholar
- Holroyd N, Sanchez-Flores A. Producing parasitic helminth reference and draft genomes at the Wellcome Trust Sanger Institute. Parasite Immunol. 2012;34(2-3):100–7.View ArticlePubMedGoogle Scholar
- Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33(14):2202–4.View ArticlePubMedGoogle Scholar
- Lu H, Giordano F, Ning Z. Oxford Nanopore MinION sequencing and genome assembly. Genom Proteom Bioinform. 2016;14(5):265–79.View ArticleGoogle Scholar
- Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27(5):722–36.View ArticlePubMedPubMed CentralGoogle Scholar
- Subirana JA, Messeguer X. A satellite explosion in the genome of holocentric nematodes. PLoS One. 2013;8(4):e62221.View ArticlePubMedPubMed CentralGoogle Scholar
- Tyson JR, O'Neil NJ, Jain M, Olsen HE, Hieter P, Snutch TP. MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans reference genome. Genome Res. 2017. https://doi.org/10.1101/099143. https://www.ncbi.nlm.nih.gov/pubmed/29273626. [Epub ahead of print].
- C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–18.View ArticleGoogle Scholar
- Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–12.View ArticlePubMedGoogle Scholar
- Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015;12(8):733–5.View ArticlePubMedGoogle Scholar
- Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9(11):e112963.View ArticlePubMedPubMed CentralGoogle Scholar
- Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014;42(Database issue):D490–5.View ArticlePubMedGoogle Scholar
- Sargison ND, Redman E, Morrison AA, Bartley DJ, Jackson F, Naghra-van Gijzel H, et al. A method for single pair mating in an obligate parasitic nematode. Int J Parasitol. 2017. https://doi.org/10.1016/j.ijpara.2017.08.010.
- Chandler J, Camberis M, Bouchery T, Blaxter M, Le Gros G, Eccles DA. Annotated mitochondrial genome with Nanopore R9 signal for Nippostrongylus brasiliensis. F1000Res. 2017;6:56.View ArticlePubMedPubMed CentralGoogle Scholar
- Marcais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Jain M, Koren S, Quick J, Rand AC, Sasani TA, Tyson JR, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. bioRxiv. 2017. https://doi.org/10.1101/128835.
- Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14(4):417–19.View ArticlePubMedPubMed CentralGoogle Scholar
- Camberis M, Le Gros G, Urban Jr J. Animal model of Nippostrongylus brasiliensis and Heligmosomoides polygyrus. Curr Protoc Immunol. 2003;19(19):12.PubMedGoogle Scholar
- Wintersinger JA, Wasmuth JD. Kablammo: an interactive, web-based BLAST results visualizer. Bioinformatics. 2015;31(8):1305–6.View ArticlePubMedGoogle Scholar