SANTA CRUZ — UC Santa Cruz researchers have compiled more than 10 million genetic variants of the COVID-19 virus from around the world and organized them into a family tree that maps the evolution of the coronavirus.
“This scale of data is really unprecedented,” said UC Santa Cruz bioinformatics programmer, Angie Hinrichs. “We’ve never had so many genomes from the same species before.”
In February 2020, after the first coronavirus strain become available, UC Santa Cruz researchers adapted the design of their existing genome browser and created one specifically to hold and display the collected genetic variants of the COVID-19 virus. Once the browser was built, coronavirus sequences flooded in daily by the thousands from researchers both in the U.S. and internationally. UC Santa Cruz researchers found themselves struggling to manage the massive influx of data.
“The tools that were available for building phylogenetic trees before the pandemic could handle a few thousand genome sequences, but suddenly we had tens of thousands,” said Hinrichs, who has worked with the university’s genome browser for more than 20 years.
With that in mind, a team of researchers was assembled, including Hinrichs, to wrangle the unfathomable amount of information into the form of a phylogenetic tree, like a family tree of the virus. A member of the team, then post-doctoral scholar, Yatish Turakhia, wrote a new software program called UShER, which allowed scientists to organize the coronavirus variants rapidly and accurately on the massive phylogenetic tree stored on the university’s coronavirus-specific browser. The number of variants in the database surpassed ten million in June.
To compare the statistics, the species with the next highest number of collected sequences in the genome browser at UC Santa Cruz is E. coli, with just more than 5 million genomic sequences, which is about half the number of collected coronavirus strains. Hinrichs points out that E. Coli has been studied by scientists for decades.
The university’s coronavirus genome browser and the phylogenetic tree it hosts have allowed scientists and researchers to track the history of the virus as it travels geographically, identify new lineages and deadly variants like omicron, or BA.1 and BA.2 as they are called on the tree, and predict superspreader events, or other potentially dangerous phenomena that can be foreseen in the tea leaves of the phylogenetic tree as it continues to grow.
“I wish the pandemic would go away and the sequences would stop flowing in, and I could put the wraps on this project, but it hasn’t happened yet,” said Heinrich. “As long as the virus keeps evolving, we’ll keep building onto this tree so that we can at least understand what it’s doing.”