A Conditional Random Field approach for de novo reconstruction of bacterial haplotypes from a de Bruijn graph representation
A Conditional Random Field approach for de novo reconstruction of bacterial haplotypes from a de Bruijn graph representation
Steyaert, A.; Van Hecke, M.; Marchal, K.; Fostier, J.
AbstractBackground: Detecting distinct bacterial strains in a mixed sample is an important, yet less well-developed aspect of metagenomic research. Several methods exist that successfully retrieve a de novo reconstruction of viral strains. However, the reconstruction of bacterial haplotypes poses its own distinct challenges, and methods that successfully reconstruct full genome-length bacterial strains de novo are scarce. Here, we develop HaploDetox, a method for de novo bacterial haplotype reconstruction from short reads. We use a de Bruijn graph representation of the reads in which nodes correspond with k-mers from the read set and arcs represent overlap between two nodes' sequences. Our aim is to accurately assign labels to each node and arc in the graph to reveal the presence or absence of their corresponding sequence in individual strains. Results: Using a negative binomial mixture model, we model the relationship between the read coverage of nodes and arcs in the graph and their presence in a strain. We achieve improved labelling accuracy by including contextual information from neighbouring nodes and arcs with a Conditional Random Field. These labels are used to extract strain-specific de Bruijn graphs from the original graph. Additionally, we allow users to assess the number of strains present in the dataset based on model selection criteria. We evaluate our node/arc labelling accuracy on simulated datasets and in silico mixes of real datasets containing different numbers of strains, as well as on in vitro mixed real datasets. Existing de novo haplotype reconstruction methods present their reconstruction as strain-specific sets of SNPs. We demonstrate that HaploDetox assigns strain-specific SNPs with a higher recall and similar precision than existing methods, by aligning the unitigs from strain-specific graphs to a reference genome. Conclusions: We achieve improved strain-specific SNP phasing accuracy as compared to existing methods for de novo bacterial haplotype reconstruction. Additionally, HaploDetox is not limited to the determination of strain-specific SNPs, and other types of variant calls can be obtained through reference alignment. Finally, strain-specific de Bruijn graphs are an important first step towards full genome-length bacterial haplotype-aware assembly.