Machine Learning in clinical biology and medicine: from prediction of multidrug resistant infections in humans to pre-mRNA splicing control in Ciliates

Mancini, Alessio

doi:10.15165/mancini-alessio_phd2022-09-29

Machine Learning methods have broadly begun to infiltrate the clinical literature in such a way that the correct use of algorithms and tools can facilitate both diagnosis and therapies. The availability of large quantities of high-quality data could lead to an improved understanding of risk factors in community and healthcare-acquired infections. In the first part of my PhD program, I refined my skills in Machine Learning by developing and evaluate with a real antibiotic stewardship dataset, a model useful to predict multi-drugs resistant urinary tract infections after patient hospitalization9 . For this purpose, I created an online platform called DSaaS specifically designed for healthcare operators to train ML models (supervised learning algorithms). These results are reported in Chapter 2. In the second part of the PhD thesis (Chapter 3) I used my new skills to study the genomic variants, in particular the phenomenon of intron splicing. One of the important modes of pre-mRNA post-transcriptional modification is alternative intron splicing, that includes intron retention (unsplicing), allowing the creation of many distinct mature mRNA transcripts from a single gene. An accurate interpretation of genomic variants is the backbone of genomic medicine. Determining for example the causative variant in patients with Mendelian disorders facilitates both management and potential downstream treatment of the patient’s condition, as well as providing peace of mind and allowing more effective counselling for the wider family. Recent years have seen a surge in bioinformatics tools designed to predict variant impact on splicing, and these offer an opportunity to circumvent many limitations of RNA-seq based approaches. An increasing number of these tools rely on machine learning computational approaches that can identify patterns in data and use this knowledge to speculate on new data. I optimized a pipeline to extract and classify introns from genomes and transcriptomes and I classified them into retained (Ris) and constitutively spliced (CSIs) introns. I used data from ciliates for the peculiar organization of their genomes (enriched of coding sequences) and because they are unicellular organisms without cells differentiated into tissues. That made easier the identification and the manipulation of introns. In collaboration with the PhD colleague dr. Leonardo Vito, I analyzed these intronic sequences in order to identify “features” to predict and to classify them by Machine Learning algorithms. We also developed a platform useful to manipulate FASTA, gtf, BED, etc. files produced by the pipeline tools. I named the platform: Biounicam (intron extraction tools) available at http://46.23.201.244:1880/ui. The major objective of this study was to develop an accurate machine-learning model that can predict whether an intron will be retained or not, to understand the key-features involved in the intron retention mechanism, and provide insight on the factors that drive IR. Once the model has been developed, the final step of my PhD work will be to expand the platform with different machine learning algorithms to better predict the retention and to test new features that drive this phenomenon. These features hopefully will contribute to find new mechanisms that controls intron splicing. The other additional papers and patents I published during my PhD program are in Appendix B and C. These works have enriched me with many useful techniques for future works and ranged from microbiology to classical statistics.