Wellcome Open Research

Helping to understand life on Earth through a new genomics search engine

In this blog post, Wellcome Open Research chats to Dr. Richard Challis (Wellcome Sanger Institute) about his recent Software Tool Article, which introduces a new search engine for genomic and sequencing project metadata across the full eukaryotic tree of life.

First, let’s meet the author

Dr Rich Challis is a Senior Bioinformatician in Evolutionary Genomics, working on the Tree of Life Programme at the Wellcome Sanger Institute. As the lead developer of two genome databasing projects, he is interested in developing tools and software to explore and assess genomic data more easily.

So, tell us, how did this Software Tool, Genomes on a Tree (GoaT), come to fruition?

Through the Darwin Tree of Life (DToL) project covering Britain and Ireland, and globally as part of the broader Earth BioGenome Project (EBP), we’re working towards generating chromosomal genome assemblies of all described eukaryotic species on Earth.

For these projects to succeed, we need to choose what to sequence in each phase, know how much raw data to generate, and have mechanisms to coordinate efforts among the various projects working under the EBP umbrella.

As head of DToL, Mark Blaxter spent a lot of time at the start of the project identifying sources of genomic metadata and pulling together target lists. At around the time he was realising the need to pull these data together into a single database, I was exploring the use of ElasticSearch to index taxonomic and assembly metadata alongside genomic features to bring another project (GenomeHubs) up to date.

When Mark Blaxter and Dr. Sujai Kumar, a fellow Senior Bioinformatician with the Tree of Life, described the plans for a database of taxonomic metadata, it seemed like a perfect test case for the indexing that I was beginning to work on.

Can you give us an overview of the software?

GoaT is a searchable datastore of genomic and sequencing project metadata across all known eukaryotes.

The most prominent output from the project is the GoaT web user interface (UI) hosted at goat.genomehubs.org, which provides an interface to find any metadata (e.g., estimated genome size) for any taxon at any rank from subspecies to phylum.

Search queries can be built and refined based on any combination of the attributes indexed in GoaT so users can identify taxa that meet very specific criteria while filtering out any that are already in progress by another project. 

The GoaT UI can also generate graphical summary reports, including trees, histograms, and scatter plots for any search results. This has enabled GoaT to become a live reporting interface for many EBP projects (including DToL) and for the EBP as a whole.

Similarly, the software can help to fill in the genomic gaps for some species. The raw data that we feed into GoaT are relatively sparse and biased towards coverage of economically important taxa, so to make sure GoaT is relevant to projects that are often targeting the least well-represented organisms, we fill the gaps for each attribute with estimated values based on the most closely related sister taxa with directly reported values.

What impacts do you hope this software will have for the field of genomics?

One of the major challenges in attempting to sequence all eukaryotic species on Earth is ensuring that the resources available can be used as efficiently as possible. GoaT aims to help with this by providing a public space for projects to announce their intentions and to document progress to help avoid large-scale duplication where the lists of target taxa overlap.

The information in GoaT can also be fed into sequencing and assembly pipelines where having an estimate of the expected genome size can help to determine the amount of sequencing data to generate, and knowing details such as ploidy and sex determination mechanisms can inform the assembly and curation processes.

Did you face any challenges throughout your project?

As mentioned, our aim in indexing the sparse raw data in GoaT was to provide estimates for taxa with no directly recorded values, and it quickly became apparent that we could also use GoaT to look for gaps and errors in the data.

For example, one of the more surprising things we noticed was that we had no direct records of a complete set of chromosomes for Nematodes (roundworms), causing GoaT to fill in an estimated chromosome count of 32 for all Nematode species. The only way around gaps like this is to collate a new resource for input to GoaT, and a Nematode chromosome counts manuscript is in prep with values found in a literature search.

Another example was that early GoaT releases showed mammals as tetraploid instead of diploid and filled missing entries with unusual sex systems instead of XY for many taxa, due to established records for mammals being missing from our source data.

This problem reflects biases towards collecting and reporting unusual values, rather than recording already established facts for many species, and catching and correcting these kinds of errors has been a key part of genomic data curation in the project.

Probably the biggest curation challenge, however, has been keeping the target lists up to date, which are the lists of species that each sequencing project declares as its targets.

To avoid duplication, GoaT needs individual projects to submit their lists as early as possible, whereas most projects have been submitting their lists after the process of taxa selection, when samples have already been collected, and, in a few cases, that has led to overlaps in target lists.

This could be avoided through projects submitting preliminary lists and using GoaT itself to refine the target taxa based on intended targets by other affiliates.

The GoaT index is updated every day, and we are increasing our efforts to communicate that lists submitted to GoaT can be updated as needed to reflect current sequencing plans. We hope this will increase communication and coordination among sequencing initiatives.

Why did you choose to publish your work with Wellcome Open Research?

Open access code and data have been essential to the development of GoaT. GoaT has been built entirely on open-source software, and the data in GoaT have been shared openly by the original producers and data collators.

A fully open access publishing platform, therefore, is the only appropriate place to publish a resource that has been generated in this way.

Similarly, GoaT has grown out of the work at DToL where Wellcome Open Research has supported the development of the Genome Note to handle the volume of assemblies that the project is generating, with a dedicated Tree of Life Gateway for all related publications.

Having the opportunity to publish GoaT alongside these Genome Notes that have been generated through processes using data from GoaT, and that themselves now feed back into GoaT, feels like a perfect fit.

What’s next for this area of study?

We’re currently working on developing GoaT in a couple of directions.

The first is focused on working more closely with communities with interests in particular areas that overlap with the data GoaT already holds, for example supporting efforts to build or update databases on speciation and reproductive biology.

The Tree of Sex consortium is planning to use GoaT as an initial source of information to begin to reconstruct their database of reproductive biology. They will then coordinate the curation of this and additional related data to become a reliable source of metadata, creating a positive feedback loop that will increase the completeness and accuracy of data available in GoaT.

The second direction is to develop the code to support more than just taxon and assembly metadata, as the aim for DToL and related projects is to “sequence everything”.

The logical evolution of the GenomeHubs code underlying GoaT is to work towards being able to “index everything” in a way that allows researchers to ask questions across the assemblies and various annotations and comparative analyses that global sequencing projects are generating.

Read the full Software Tool Article today on Wellcome Open Research to dive deeper into the study and findings, and discover related Tree of Life research in the dedicated Tree of Life Gateway.


COMMENTS