Project proposal MSc in Computational Genetics and Bioinformatics 2001/2

Project IX: Protein classification via motif counting and analysis.

Proteins are typically classified into families according to functional and sequence similarities. The similarity of the proteins across the sequences in each family is far from uniform. While some regions are clearly conserved, others display little sequence similarity. Often the conserved regions are crucial to the protein’s function, for example enzymatic catalytic sites. Such conserved regions can be used to probe an uncharacterized sequence to indicate its function

The description of a protein family by its conserved regions focuses on the family’s characteristic and distinctive sequence features. Databases of conserved features of protein families can be utilized to classify sequences from proteins, cDNAs and genomic DNA.

This project will concentrate on classification methods derived from marginal and sequential motif counting. The statistical testing methods used will be based on approximate normal theory and simulation-based methods, such as permutation tests.