Project II: Classification methods for small n large p problems in microarray analysis

 

Classification of tissue type from microarray expression data is complicated by the fact that typically we only have a small number of tissue samples (small n) but a large number of gene expressions (large p).  For instance, in a microarray analysis of tissue samples taken on breast cancer patients we have data set of 22 tissue samples (each one labelled either ‘hereditary’ or ‘sporadic’) and an associated set of measurements on the expression of 3226 genes for each sample. The task is to determine which, if any, of the 3226 gene expressions are important in distinguishing between the two types of cancer and what this relationship is. The scale of the problem whereby the number of measurements p is much larger than the number of samples n complicates the problem considerably.

 

The principal aim of the project is to firstly investigate how different models (such as neural networks and regression models) cope with this problem by investigating how well they perform on a number of benchmark data sets, and to research into methods of improving performance in these examples. This approach will result in ways to adjust existing models or even in developing bespoke classification tools for this task.

 

 In addition, in a slightly different approach to the discovery of influential genes, hypothesis testing methods will be used in the study of differential expression.  Issues such as parametric versus non-parametric analysis, multiple testing, false discovery rates and calibration are currently the subject of much attention, and this project will study the role of such techniques in the gene discovery context, as they may give insight into how good classifiers may be constructed.