Project proposal MSc in Computational Genetics and Bioinformatics 2001/2

Project I: Clustering Methods For Gene Expression Analysis

Clustering methods are concerned with finding group structure in data, typically by examining the distances between data points. Sometimes, this group structure reveals interesting or informative relationships among the observations.

In bioinformatics, clustering methods have been successfully used to examine gene expression data often derived from microarray experiments, to suggest families of genes, deduce genetic pathways and postulate function for unclassified genes. In another experimental context, clustering is also used to explore relationships between proteins on the basis of mass spectrograms.

Routine application of clustering methods is not necessarily straightforward, not in the least since the methods are essentially based on distances between data points, and hence are sensitive (or rather, not robust) to the choice of distance metric. Furthermore, different algorithms construct the groups in different ways, and therefore lead to substantively different conclusions. Such an analysis of robustness is common (and required) in most common applications of clustering in statistical analysis.

The purpose of this project is to explore the robustness of the conclusions arising from clustering bioinformatics data. Emphasis will be placed on statistical, visualization and computational of clustering.