BMI-203, Spring 2002, Final Project:

This Zip file in which this document was contained has two other files: data-train.txt and data-test.txt. These contain gene expression data from human leukemias (38 training samples and 34 testing samples). See Golub et al. (1999) for details if curious.

You will be using these data files to train and test a neural network for prediction of myeloid versus lymphoblastic (AML vs. ALL) leukemia subtype. The training file contains three class labelings. The first is correct and the other two are randomly assigned. You will be exploring the behavior of your system using cross-validation on the training set under various conditions followed by testing your trained systems on the test set. You will use fully-connected networks with a single hidden layer for all experiments.

Here is what you need to do:

1) Augment your neural network code to support input variable preselection based on the t statistic. You should be able to choose the N input variables with highest t statistic based on the class labels in a training set. The network should only depend on these input variables.

2) Generalize your NN code to perform leave-one-out cross-validation, with variable preselection on only the data using in each cross-validation for training. It is not OK to use all data examples for variable preselection, only those that are part of each round of cross-validation.

3) Implement a facility for training (with variable preselection) on a training set and testing on an independent test set.

4) Experiment with different preselection set sizes, learning rates, and stopping criteria using cross-validation on the training set. Using the parameter regime that works best under cross validation, train a network using the training data and test it using the test data.

5) Run the identical parameter regime from (4) using the randomly relabeled training classifications with the training data and test these two networks on the test data.

6) Try to construct a parameter regime such that your network can learn the two bogus mappings on the training set to greater than 90% correct training classification. Note: this is not a cross-validation, just a fit to the training data. Test your best trained network for each randomly labeled training set on the test data.

What to turn in:

1) Your program listing.

2) A description of the parameter exploration you performed under (4). Report your best cross-validation result. Show the exact network output for the blind test using your optimal parameter regime. Is your test performance close to the training cross-validation performance? Why or why not?

3) Your results for (5). Is your test performance better or worse than when using properly labeled data? Why or why not?

4) Your results for (6). What does this tell you about the dangers in overfitting training data?