Master degree in Engineering at Telecom Paris
Unravelling non-coding driver alterations in cancer with deep learning.
Of the 3 gigabases that constitute the human genome, only about 50 megabases (<2%) encode protein-coding genes. Particular attention has been paid to somatic mutations affecting the coding sequence of these genes, leading to the almost exhaustive characterization of 723 genes implicated in cancer (cancer gene census, COSMIC database, September 2019). By contrast, at the notable exception of TERT promoter mutations that induce the expression of telomerase (a key enzyme necessary for unlimited cell proliferation), very few driver alterations have been identified in the non-coding genome. Analysis of mutation hotspots or known regulatory regions like promoters and enhancers have failed to identify significantly recurrent mutations with a strong transcrptional impact on cancer genes. The main reason for that is the difficulty to predict the functional consequence of non-coding mutations. Although these mutations can alter important regulatory regions and modulate the expression of key cancer genes, there is no established method to predict the transcriptional impact of a non-coding mutation. To fill this gap, we will develop a deep neural network able to predict gene expression based on the local sequence context. Pioneer studies have demonstrated the ability of deep neural networks to learn how to recognize several regulatory motifs from the DNA sequence, including splicing sites, chromatin accessibility and 3D conformation or transcription factor binding sites. More recently, Olga Troyanskaya’s team has developed a deep neural network integrating able to predict, from the DNA sequence, the expression level of genes in a cell-type specific manner, by integrating predictions of chromatin state and transcription factor binding. Once trained, these neural networks are able to predict in silico the regulatory impact of any sequence variant, and are thus extremely valuable assets to identify disease coding variants. Deep learning analysis has been used to identify causal variants in several diseases including autism, but have not yet been applied to cancer. Our hypothesis is that leveraging the power of deep neural network to explore the millions of somatic alterations identified in cancer sequencing projects is a promising approach to uncover the missing driver events involving the non-coding human genome.