A hybrid de novo viral assembly system

Background

Within an infected host arises an intricate network of molecular mutations, known as quasispecies. These are not just a few mutants; these are immense numbers of incredibly similar yet distinct variant viral genomes. Collectively, they evolve a complex tapestry of adaptable mutant forms from which pathogenesis-driving phenotypes may arise.
The existence of these quasispecies is not just a fascinating phenomenon; they hold profound implications for therapeutic interventions, clinical outcomes, and the design of future drugs. Therefore, our ability to analyze these networks, decode their compositions, and anticipate their future trajectories, becomes essential.
This is where GATACA’s GAT/ML™ takes center stage. We go beyond just examining these quasispecies distributions; we chart their futures.

Challenges

De novo assembly of NGS is currently required

High haplotype similarities and fleeting QS nature challenge current algorithms.

Only rough approximations of the quasispecies or incomplete solutions are possible

Due to computational unwieldiness and spurious results of current tools.

Currently, virologists must choose between accuracy and computational efficiency

Improved analytics and performance require specialized algorithm training.

Our Solution

A novel synergistic framework combining machine learning (ML) algorithms and de novo assembly

‘GAT/ML’

pipeline performance

Assembly Quality, Patient Data

Metrics in which GAT/ML is superior are the most crucial for bridging the common trade-off of precision vs. recall in haplotype reconstruction.

Contig quality metrics on two patient datasets:

Mode	Largest contig	N50	# contigs	# contigs >1000bp
Sample # SRR12712485
Popular viral assembler #1	1855	1179	575	2
GAT/ML	3285	3285	4	3
Popular viral assembler #2	1476	1461	41	2
GAT/ML	2934	1391	7	4
Sample # SRR12712486
Popular viral assembler #1	2314	2314	140	2
GAT/ML	3106	1806	4	4
Popular viral assembler #2	2101	2101	2	2
GAT/ML	3316	3316	1	1
Learn More

pipeline performance

Assembly Improvements, Across Experiments

GAT/ML demonstrated the unanticipated ability to assist competitor assemblers by ‘rescuing’ their failed assembly results, producing long accurate contigs from highly fragmented outputs. As shown above, the number of full-length HBV contigs (>3kb) increases with the application of GAT’s ML network on all failed assemblies and datasets. Most of the competitor assemblies produced enormous populations of short false positive contig fragments, that GAT/ML lengthened and reduced.

pipeline performance

Assembly Rescue, In Vitro and Patient Data

GAT/ML’s ability to rescue failed assemblies is profound for in vitro and patient data, which tax and confuse competitor assemblers (shown below).

pipeline performance

Assembly Sensitivity, In Vitro Data

In a replicated experiment, GAT/ML recovered two HBV full-length haplotypes from NGS samples of an in vitro cultured mixture of two nearly identical strains (99.9%), differing by 1 single nucleotide polymorphism (SNP).

pipeline performance

Assembly Efficiency, Across Experiments

GAT/ML has greater complexity than other viral and non-viral assembly tools, yet this complexity does not translate to a slower speed (longer wait times) or greater memory usage. As shown, there is no appreciable speed reduction when using assembly alone (blue dots), then adding ML modules (green dots), then further adding the GAT/ML scaffolding/second assembly round (purple dots).

PIPELINE PERFORMANCE

Assembly Accuracy, Patient Data

GAT/ML significantly improved de novo assembly accuracy, most notably on complex, real-world clinical datasets.

Dataset	Target	Number of samples	Exact matches	(%)	Precision	*Recall	F-1
Private patient dataset #1, LOBP	Genotype	52	51	98%	100%	98%	99%
Private patient dataset #2, LOBP	Subgenotype	22	22	100%	100%	100%	100%
SRA dataset, NIH	Genotype	22	18	82%	86%	82%	83%
Private patient dataset, Universitari Vall d’Hebron	Genotype	44	33	75%	91%	75%	81%

pipeline component performance

ML-1 Genotypes

The ML-1 module is a read binner that pre-processes the NGS reads prior to assembly. Similarly clustered reads are segregated into variation categories, here by HBV genotype. All data are verified using strong bootstrap support and monophyletic clustering.

pipeline component performance

ML-1 Subgenotypes

Refined training on HBV subtypes enables ML-1 to bin NGS reads into HBV sub-genotypes.

pipeline component performance

ML-2 Contig Fingerprinting

GAT’s ML-2 module follows assembly, bins the contigs into similar clusters, and then refines or “fingerprints” the contigs with defining features predictive of their classification.

Confusion matrix of HBV genotype predictions showing actual vs. predicted bin classes (left), and scatter plot (right) showing the position of each contig in two of the 13 latent dimensions identified by GAT/ML’s contig fingerprinting module. In just two dimensions, the contigs of each genotype occupy distinct, well-defined regions.

pipeline performance

Conclusions

Assembly alone has reached its limits to overcome technical obstacles. Specialized training is required.
Our novel design using ML and NLP accurately and efficiently reconstructed full-length, HBV variant genomes, including highly similar sequences derived from cell cultures.
ML focused the assembly of NGS samples; superior performance was observed in metrics most crucial for bridging the common trade-off of precision vs. recall.
A 'Goldilocks' region for sequencing depth was found and a proprietary subsampling approach was used to reduce coverage problems.

Current Scientific Enhancements to the Pipeline:

New Virus (HIV):

To expand system capabilities
To prepare for HIV/HBV coinfection analytics

New Analyses Requirements:

Identify latent and emerging drug resistance
Determine co-receptor tropism
Screen monoclonal antibody candidates
Quasispecies diversity and pathogenesis
Predict disease severity and progression markers

New Training Models to:

Learn HBV and HIV antiviral drug resistance
Predict ≥1 drug class, including mAb resistance
Quasispecies timecourse

New Routines:

Statistical evaluations of quasispecies
Reporting metrics
Improved ‘permissive’ assembly
Recognize numerous sequence technology outputs (illumina, Nanopore, PacBio, etc)

Accessibility:

Cloud-based SaaS
Local installations
API for integration/interfacing with existing pipelines

A hybrid de novo viral assembly system

Background

Challenges

Our Solution

‘GAT/ML’

Assembly Improvements, Across Experiments

Assembly Rescue, In Vitro and Patient Data

Assembly Efficiency, Across Experiments

ML-1 Genotypes

ML-1 Subgenotypes

Conclusions

Current Scientific Enhancements to the Pipeline:

New Virus (HIV):

New Analyses Requirements:

New Training Models to:

New Routines:

Accessibility:

About company

Company

Customer

Subscribe to newsletter