Petabase-scale sequence alignment catalyses viral discovery Edgar et al. Nature January 2022
Public sequence databases (NCBI, EBI) contain over 20 peta bases and are growing exponentially.
Peta is 1015, so 1 petasecond is 31.7 million years, 1 light year is 9.461 petametres.
The sequences are filed as evidence supporting publications and as outputs from large scale projects. Because the scale is large, they are hard to search efficiently. This paper presents open source tools to perform sequence alignment using cloud computing. They looked for evidence of previously unidentified RNA viruses and identified of 100,000 novel viruses, over 10 times the number already known. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.
Old-school methods for finding viruses usually involved growing them in a suitable hosts like chick embryos or cell culture, or in some cases by electron microscopy of infected material. Recent methods have switched to DNA sequencing since it has increased in scale and reduced in costs. Sequence data is stored as text or binary files and a large quantity is publicly available. Sequences are identified and characterised by aligning the sequence with a reference sequence.
“Accessing the planetary virome”
For their source of sequences they used the Sequence Read Archive (SRA) at NCBI (National Center for Biotechnology Information, Washington). The used DIAMOND, a software tool that harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP.
What did they use as the template to search? One target was RNA dependent RNA polymerase (RdRP) which has a well-conserved amino acid subsequence referred to as a “palmprint”.
3,376,880 (59.38%) sequencing runs contained one or more reads that mapped to the RdRP query. These were assembled and grouped into 15,016 known “operational taxonomic units” (“viral species”) and 131,957 novel sOTUs, representing an increase in the number of known RNA viruses by a factor of 9.8. Likely that this type of search will miss many highly diverged or “dark” viruses. So far captured at most 0.1% of the global virome. However, if exponential data growth combined with increased search sensitivity continues, we are at the cusp of identifying a notable fraction of Earth’s total genetic diversity.
Human population growth and encroachment on animal habitats is bringing more species into proximity, leading to an increased rate of zoonosis and accelerating the Anthropocene mass extinction48. Thus, investment in the collection and curation of biologically diverse samples, with an emphasis on geographically underrepresented regions, has never been more pressing—if not for the conservation of endangered species, then to better conserve our own.
Comments
Post a Comment