A hybrid de novo viral assembly system
Background
Within an infected host arises an intricate network of molecular mutations, known as quasispecies. These are not just a few mutants; these are immense numbers of incredibly similar yet distinct variant viral genomes. Collectively, they evolve a complex tapestry of adaptable mutant forms from which pathogenesis-driving phenotypes may arise.
The existence of these quasispecies is not just a fascinating phenomenon; they hold profound implications for therapeutic interventions, clinical outcomes, and the design of future drugs. Therefore, our ability to analyze these networks, decode their compositions, and anticipate their future trajectories, becomes essential.
This is where GATACA’s GAT/ML™ takes center stage. We go beyond just examining these quasispecies distributions; we chart their futures.
Challenges
High haplotype similarities and fleeting QS nature challenge current algorithms.
Due to computational unwieldiness and spurious results of current tools.
Improved analytics and performance require specialized algorithm training.
Metrics in which GAT/ML is superior are the most crucial for bridging the common trade-off of precision vs. recall in haplotype reconstruction.
Mode | Largest contig | N50 | # contigs | # contigs >1000bp | |
---|---|---|---|---|---|
Sample # SRR12712485 | |||||
Popular viral assembler #1 | 1855 | 1179 | 575
| 2
| |
GAT/ML | 3285 | 3285 | 4 | 3 | |
Popular viral assembler #2 | 1476
| 1461 | 41 | 2 | |
GAT/ML | 2934 | 1391 | 7 | 4 | |
Sample # SRR12712486 | |||||
Popular viral assembler #1 | 2314 | 2314 | 140 | 2 | |
GAT/ML | 3106 | 1806 | 4 | 4 | |
Popular viral assembler #2 | 2101 | 2101 | 2 | 2 | |
GAT/ML | 3316 | 3316 | 1 | 1 | |
Assembly Improvements, Across Experiments
GAT/ML demonstrated the unanticipated ability to assist competitor assemblers by ‘rescuing’ their failed assembly results, producing long accurate contigs from highly fragmented outputs. As shown above, the number of full-length HBV contigs (>3kb) increases with the application of GAT’s ML network on all failed assemblies and datasets. Most of the competitor assemblies produced enormous populations of short false positive contig fragments, that GAT/ML lengthened and reduced.
Assembly Efficiency, Across Experiments
GAT/ML has greater complexity than other viral and non-viral assembly tools, yet this complexity does not translate to a slower speed (longer wait times) or greater memory usage. As shown, there is no appreciable speed reduction when using assembly alone (blue dots), then adding ML modules (green dots), then further adding the GAT/ML scaffolding/second assembly round (purple dots).
GAT/ML significantly improved de novo assembly accuracy, most notably on complex, real-world clinical datasets.
Dataset | Target | Number of samples | Exact matches | (%) | Precision | *Recall | F-1 |
---|---|---|---|---|---|---|---|
Private patient dataset #1, LOBP | Genotype
| 52
| 51 | 98%
| 100%
| 98%
| 99%
|
Private patient dataset #2, LOBP | Subgenotype | 22 | 22 | 100%
| 100%
| 100%
| 100%
|
SRA dataset, NIH | Genotype | 22 | 18 | 82%
| 86%
| 82%
| 83%
|
Private patient dataset, Universitari Vall d’Hebron | Genotype | 44 | 33 | 75%
| 91%
| 75%
| 81%
|
ML-1 Genotypes
The ML-1 module is a read binner that pre-processes the NGS reads prior to assembly. Similarly clustered reads are segregated into variation categories, here by HBV genotype. All data are verified using strong bootstrap support and monophyletic clustering.
GAT’s ML-2 module follows assembly, bins the contigs into similar clusters, and then refines or “fingerprints” the contigs with defining features predictive of their classification.
Confusion matrix of HBV genotype predictions showing actual vs. predicted bin classes (left), and scatter plot (right) showing the position of each contig in two of the 13 latent dimensions identified by GAT/ML’s contig fingerprinting module. In just two dimensions, the contigs of each genotype occupy distinct, well-defined regions.
Conclusions
- Assembly alone has reached its limits to overcome technical obstacles. Specialized training is required.
- Our novel design using ML and NLP accurately and efficiently reconstructed full-length, HBV variant genomes, including highly similar sequences derived from cell cultures.
- ML focused the assembly of NGS samples; superior performance was observed in metrics most crucial for bridging the common trade-off of precision vs. recall.
- A 'Goldilocks' region for sequencing depth was found and a proprietary subsampling approach was used to reduce coverage problems.
Current Scientific Enhancements to the Pipeline:
New Virus (HIV):
- To expand system capabilities
- To prepare for HIV/HBV coinfection analytics
New Analyses Requirements:
- Characterize sequence diversity of HIV
- Identify latent and emerging drug resistance
- Determine co-receptor tropism
- Screen monoclonal antibody candidates
- Quasispecies diversity and pathogenesis
- Predict disease severity and progression markers
New Training Models to:
- Learn HBV and HIV antiviral drug resistance
- Predict ≥1 drug class, including mAb resistance
- Quasispecies timecourse
New Routines:
- Statistical evaluations of quasispecies
- Reporting metrics
- Improved ‘permissive’ assembly
- Recognize numerous sequence technology outputs (illumina, Nanopore, PacBio, etc)
Accessibility:
- Cloud-based SaaS
- Local installations
- API for integration/interfacing with existing pipelines