Planetary-Scale Sequencing Data Analysis Surveys Life’s Diversity
Link: https://cnrs.zoom.us/j/96910012529?pwd=F6alZrWtm7aNxPj3TqTDAPPTOeOeKb.1
Speaker: Ryan Chikhi, PRAIRIE/Institut Pasteur, CNRS – Research director Department of Computational Biology.
Abstract
Petabytes of valuable DNA sequencing data reside in public repositories,doubling in size every two years. They contain a wealth of genetic information about viruses, bacteria, animals, humans. We have developed two bioinformatics cloud infrastructures, named Serratus and Logan, to perform petabase-scale sequence analysis. With Serratus, we analyzed all available RNA-seq samples (5.7 million samples, 10 petabytes) and discovered ten times more RNAviruses than previously known, including a new family of coronaviruses (Edgar et al, Nature, 2022). In Logan, we are making Earth’s data more accessible by reducing its size by 100x without significant loss of information. This will allow for many applications, e.g. the training of foundational models.
Bio
Rayan Chikhi is a G5 group leader at Institut Pasteur and CNRS research director in the Department of Computational Biology, Sequence Bioinformatics team, and holds a chair at the PRAIRIE institute.
His research combines algorithms, machine learning and statistical techniques to mine through large amounts of DNA sequencing data. The plan is to develop new computational methods to perform an initial analysis of raw sequencing data, and then apply supervised machine learning methods to detect clinically relevant variants. The project fosters connections with three disciplines: sequence bioinformatics, AI, and a high-profile clinical application. It is thus part of a biological and interdisciplinary side of PRAIRIE. It also tackles analysis of ‘very big data’, as each human genome yields around 100 gigabases of raw data, and studied cohorts typically gather thousands of samples or more.