Project proposal MSc in Computational Genetics and Bioinformatics 2001/2

Project III: Detecting coding regions using frequency decomposition - Testing the consensus codon hypothesis

It has been observed during spectral analysis of DNA sequences (using easy to implement transformation methods) that exons exhibit a distinctive spectral profile, that is, there is a marked peak in the spectrum at frequency 1/3, but little other noticeable structure. This phenomenon has been observed for a wide range of genomes, but is not readily detectable for short sequences. Naturally, if such a spectral pattern could always be associated with exons, but not be present elsewhere, then an elementary discrimination or gene prediction tool could be devised (in fact, such a tool is already available in some gene prediction packages). It has been proposed that this ubiquitous spectral pattern is observed for coding regions due to the presence of a consensus codon – the codon that formed the particular primitive genomic region - that has been conserved through evolutionary history, whereas in non-coding regions, the lack of conservation has corrupted the original consensus codon pattern. If this hypothesis is true, then the spectrum peak at 1/3 is not really indicative of any periodic behaviour, but merely corresponds to a significant non-uniformity in the codon distribution. This, however, is also interesting as it gives a possible alternate method for gene prediction.

This project will investigate the spectral phenomenon and the consensus codon hypothesis with a goal of producing an effective gene prediction mechanism. It will involve extracting test sequences from genome databases, implementing the spectral analysis methods described above, investigating other related analysis methods and testing the consensus codon hypothesis.

Classical testing procedures (based on and calibrated by re-sampling methods) will be used.