A hybrid de novo viral assembly system

Background

Within an infected host arises an intricate network of molecular mutations, known as quasispecies. These are not just a few mutants; these are immense numbers of incredibly similar yet distinct variant viral genomes. Collectively, they evolve a complex tapestry of adaptable mutant forms from which pathogenesis-driving phenotypes may arise.
The existence of these quasispecies is not just a fascinating phenomenon; they hold profound implications for therapeutic interventions, clinical outcomes, and the design of future drugs. Therefore, our ability to analyze these networks, decode their compositions, and anticipate their future trajectories, becomes essential.
This is where GATACA’s GAT/ML™ takes center stage. We go beyond just examining these quasispecies distributions; we chart their futures.

Challenges

De novo assembly of NGS is currently required

High haplotype similarities and fleeting QS nature challenge current algorithms.

Only rough approximations of the quasispecies or incomplete solutions are possible

Due to computational unwieldiness and spurious results of current tools.

Currently, virologists must choose between accuracy and computational efficiency

Improved analytics and performance require specialized algorithm training.

Our Solution

A novel synergistic framework combining machine learning (ML) algorithms and de novo assembly

‘GAT/ML’

pipeline performance
Assembly Quality, Patient Data

Metrics in which GAT/ML is superior are the most crucial for bridging the common trade-off of precision vs. recall in haplotype reconstruction.

Contig quality metrics on two patient datasets:
Mode
Largest contig
N50
# contigs
# contigs >1000bp
Sample # SRR12712485
Popular viral assembler #1
1855
1179
575
2
GAT/ML
3285
3285
4
3
Popular viral assembler #2
1476
1461
41
2
GAT/ML
2934
1391
7
4
Sample # SRR12712486
Popular viral assembler #1
2314
2314
140
2
GAT/ML
3106
1806
4
4
Popular viral assembler #2
2101
2101
2
2
GAT/ML
3316
3316
1
1
pipeline performance

Assembly Improvements, Across Experiments

GAT/ML demonstrated the unanticipated ability to assist competitor assemblers by ‘rescuing’ their failed assembly results, producing long accurate contigs from highly fragmented outputs. As shown above, the number of full-length HBV contigs (>3kb) increases with the application of GAT’s ML network on all failed assemblies and datasets. Most of the competitor assemblies produced enormous populations of short false positive contig fragments, that GAT/ML lengthened and reduced.

pipeline performance

Assembly Rescue, In Vitro and Patient Data

GAT/ML’s ability to rescue failed assemblies is profound for in vitro and patient data, which tax and confuse competitor assemblers (shown below).

pipeline performance
Assembly Sensitivity, In Vitro Data

In a replicated experiment, GAT/ML recovered two HBV full-length haplotypes from NGS samples of an in vitro cultured mixture of two nearly identical strains (99.9%), differing by 1 single nucleotide polymorphism (SNP).

pipeline performance

Assembly Efficiency, Across Experiments

GAT/ML has greater complexity than other viral and non-viral assembly tools, yet this complexity does not translate to a slower speed (longer wait times) or greater memory usage.  As shown, there is no appreciable speed reduction when using assembly alone (blue dots), then adding ML modules (green dots), then further adding the GAT/ML scaffolding/second assembly round (purple dots).

PIPELINE PERFORMANCE
Assembly Accuracy, Patient Data

GAT/ML significantly improved de novo assembly accuracy, most notably on complex, real-world clinical datasets.

Dataset
Target
Number of samples
Exact matches
(%)
Precision
*Recall
F-1
Private patient dataset #1, LOBP
Genotype
52
51
98%
100%
98%
99%
Private patient dataset #2, LOBP
Subgenotype
22
22
100%
100%
100%
100%
SRA dataset, NIH
Genotype
22
18
82%
86%
82%
83%
Private patient dataset, Universitari Vall d’Hebron
Genotype
44
33
75%
91%
75%
81%
pipeline component performance

ML-1 Genotypes

The ML-1 module is a read binner that pre-processes the NGS reads prior to assembly. Similarly clustered reads are segregated into variation categories, here by HBV genotype. All data are verified using strong bootstrap support and monophyletic clustering.

pipeline component performance

ML-1 Subgenotypes

Refined training on HBV subtypes enables ML-1 to bin NGS reads into HBV sub-genotypes.

pipeline component performance
ML-2 Contig Fingerprinting

GAT’s ML-2 module follows assembly, bins the contigs into similar clusters, and then refines or “fingerprints” the contigs with defining features predictive of their classification.

Confusion matrix of HBV genotype predictions showing actual vs. predicted bin classes (left), and scatter plot (right) showing the position of each contig in two of the 13 latent dimensions identified by GAT/ML’s contig fingerprinting module. In just two dimensions, the contigs of each genotype occupy distinct, well-defined regions.

pipeline performance

Conclusions

Current Scientific Enhancements to the Pipeline:

New Virus (HIV):
  • To expand system capabilities
  • To prepare for HIV/HBV coinfection analytics
New Analyses Requirements:
New Training Models to:
  • Learn HBV and HIV antiviral drug resistance
  • Predict ≥1 drug class, including mAb resistance
  • Quasispecies timecourse
New Routines:
  • Statistical evaluations of quasispecies
  • Reporting metrics
  • Improved ‘permissive’ assembly
  • Recognize numerous sequence technology outputs (illumina, Nanopore, PacBio, etc)
Accessibility:
  • Cloud-based SaaS
  • Local installations
  • API for integration/interfacing with existing pipelines