C++ code for Gaussian process clustering of multidimensional time series Before downloading the software I have written, you should read the copyright.

Code Download:

This is C++ code for all of the clustering methodology we developed on my Wellcome Trust post doc. To compile the code (instructions below), you will need to have a recent version of the GNU scientific library installed. Windows users can download an executable for installing the GSL here.

  • splinecluster.tar.gz (last updated October 2010) - Code for fitting Gaussian process models with spline basis functions to multivariate time series. Compiling produces the executable SplineCluster for performing clustering and then curve fitting for each cluster.

    If you have problems compiling and are running Linux, Digital Unix, Windows, Cygwin under Windows or Mac OS X, you may wish to try creating a directory called splinecluster/ and simply copying in the correct static executable instead:

  • <SplineCluster> (Linux 32-bit).

  • <SplineCluster> (Linux 64-bit).

  • <SplineCluster> (Mac OS Sierra 10.12.3).

  • <SplineCluster.dmg> (Mac OS X v10.7.5).

  • <SplineCluster.exe> (Windows)(OLD version).

  • <SplineCluster.exe> (Cygwin) (OLD version).


    To Compile:


    To unpack the files, enter the command

    tar -xvzf splinecluster.tar.gz

    To compile the C++ code, enter the splinecluster/ directory and type

    make (or for Mac OS X, make -f makefile_macOSX)

    Usage:


    In general, the programs expect the data to be either in the column vector format:

      subject 1 @ time 1
      subject 1 @ time 2
      	  .
    	  .
      subject 1 @ time T
      subject 2 @ time 1
              .
              .
    
    or the matrix format:
    
      subject 1 @ time 1   subject 1 @ time 2   . . .   subject 1 @ time T
      subject 2 @ time 1   subject 2 @ time 2   . . .   subject 2 @ time T
              .		       .	      .		    .
              .		       .	      .		    .
    
    
    though in the latter case, it is important then to ensure that the shell command contains the argument targetcolumns=1.

    Output Files:

    Suppose the final output clustering had C clusters. Under default settings certain files are generated, with the following filename endings.

    • _clusters_.dat - Each row of this file corresponds to one gene. The first column is the output cluster number for that gene (a number between 1 and C), the remaining columns the expression time series for that gene.
    • _dendro_output_.dat - Gives an ordering of the genes as they should appear in a dendrogram for the hierarchical clustering. The gene numbers correspond to the order in which the gene data were input.
    • _membership_probs_.dat - Each row corresponds to one gene (in the order in which the gene data were input). Each column gives the predictive probability of the gene belonging to each of the C clusters.
    • _mergerdetails_.dat - A file to describe the order in which the agglomerative clustering process happened. In each row, the first column gives the number of clusters in the current model; the next four columns contain representative gene numbers of the two clusters which were merged in the previous agglomeration step (two for each cluster, representing extreme genes in some sense for each cluster). The penultimate column gives the log joint probability of the current model, and the final column gives an (unused) BIC score.
    • _partition_.dat - row i of this file gives the cluster number to which gene i is assigned.
    • _preds_.dat - used for making plots, this file has the lower credible interval, mean and upper credible interval curves for each cluster evaluated over a grid of points.
    • _residuals_.dat - the residuals of the input data after subtracting the fitted curve values.
    Single Experiment Example Data Set and Shell Script:

  • ecoli.tar.gz - An example data set, a subset of which was used for analysis in Genome expression analysis of Anopheles gambiae: Responses to injury, bacterial challenge, and malaria infection (Dimpoloulos et al, PNAS, 2002) and can be found in supplementary data tables at http://www.pnas.org/cgi/content/full/092274999/DC1. The tar file also contains a shell script for running the code in splinecluster/ and generating output files for viewing using R. The data set contains a measurement of relative gene expression under the first of two experimental conditions for each of 2596 genes, taken at 6 identical, unequally spaced time points - 1hr, 4hrs, 8hrs, 12hrs, 18hrs and 24hrs.


    To Run:

    To unpack the test data set, enter the command

    tar -xvzf ecoli.tar.gz

    Then, after going into the ecoli/ directory, entering

    ./ecoli_shell

    will run a shell script to perform hierarchical clustering on the E. coli data set provided. This script will also produce an output file of cluster images, such as the ones below, in ecoli_preds.pdf. The figures are created using the statistical package R.

    Multiple Experiment Example Data Set and Shell Script:

  • simulated_multiple_experiment_data.tar.gz - A toy simulated data set, where expression levels are recorded across 6 experiments with varying time points. A shell script and R script for running the code and producing output figures are also provided. The extra command datablocksizes= in the shell script tells the code how many time points each experiment series has.

    To Run:

    To unpack the test data set, enter the command

    tar -xvzf simulated_multiple_experiment_data.tar.gz

    Then, after going into the simulated_multiple_experiment_data/ directory, entering

    ./simulated_shell

    will run a shell script to perform hierarchical clustering on the simulated multiple experiment data set. This script will also produce an output file of cluster images using the statistical package R.


    Example Output Figures:

    This software is still being updated.

    Return to the homepage.