Project proposal MSc in Computational Genetics and Bioinformatics 2001/2

Project IV: Hidden Markov Models for Genomic Structural Analysis

The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. The problem is closely related to the fundamental biochemical issues of specifying the precise sequence determinants of transcription, translation and splicing. The solution increasingly depends on computer software (for example, exon prediction, location of splice sites etc) is routinely used by genome sequencing laboratories. The prediction accuracy ab initio is often reasonably high, but sometimes lower than would be regarded as acceptable elsewhere in machine learning. The key learning methodology is related to Hidden Markov models, and HMMs have proved to be the basis of the most successful prediction and annotation schemes.

At the heart of the gene detection problem is a classification task. We wish to classify, point-by-point if necessary, regions of the DNA sequence to one of a number of previously identified classes. HMMs achieve this essentially by using computational learning procedures, and often extensive training data, to form the basis of a classifier.

The aim of this project is to study, and to compare and contrast the probabilistic models, learning processes, and computational algorithms that underpin the most popular packages and servers that are used for gene prediction (HMMgene, GENSCAN, GRAIL etc). The key goal is to identify analytic situations in which the algorithms perform well, but more crucially to discover when and hopefully why the algorithms fail, as this aspect will usually lead to improvement of the analysis method. Further, HMM-based analysis will be contrasted with methods such as pattern detection methods and data-mining procedures that have been implemented very successfully in fields such as computer vision, and other recently developed classification procedures.