This describes our attempts at the Machine Learning project for 10601(A). Thanks to my team-mate Mona Ramadan and our professors Dr. Aarti Singh and Dr Gordon for guiding and mentoring us throughout.


Machine leaning techniques are used to study the brain activity from fMRI data. The data under investigation is composed of pre-processed signal recordings of 5903 voxels taken from a single fMRI scan. The subjects are performing a simple task of pressing a button when a moving bar reaches a line, but not to press the button if they get an early or late stop signal. Support vector machines are used to classify the data and principle component logistic regression is used to to predict the missing features. The data was successfully classified with an accuracy of 0.6134, and the missing data was predicted with an MSE of 0.4673.

Problem Setup

The medical experiments subjects are asked to perform a simple task of pressing a button when given a signal, based on their actions three classes (0,1,3) can be identified.

  • (0) No Event.
  • (1) Early Stop: Successful stop to an early stop signal.
  • (3) Correct Go: Correct button press (within 500ms) on a trial with no stop signal.

The goal of the first part of the project was to classify the data into three classes. A set of 501 labeled training examples was provided, where each training observation is composed of a set of 5903 voxels recordings, the x; y; z coordinates of each voxel, the time at which the fMRI scan was recorded and the classification labels. Using the labeled observations, a set of 1001 unlabeled observations was classified. The objective of the second part of the project was to predict missing voxels values. A set of a 1000 observations from partial fMRI images containing voxels readings of 3172 voxels was used to predict the missing 2731 voxels readings. In the final part of the project, semi-supervised learning techniques were used to enhance the prediction accuracy of the classification problem in part-1. Training data from both part-1 and part-2 was used to classify the test data while predicted classes are compared to the recently released true labels.


Part 1 - Supervised Learning

First, the training data was shuffled twice to ensure randomness among the data points. Then, it was divided into a training set (401 examples) and an internal hold-out set (100 examples). The hold-out set was not used in the internal cross validation tests, it was only used to internally test the trained model. 10-folds CV was imposed on all the classification models.

The training data consisted of 5903 features, and hence was quite difficult to visualize. PCA was used to reduce the feature space to only 2 features for the sake of visualization. Upon plotting the data (Figure 1) we could see that the data is linearly in-separable. This helped us to narrow down the search for a classification method to non-linear classifiers. A pipeline that applied random forest, gradient boosting, K-NN, support vector machines and Adaboost was tested in Python [3],[2], where all the parameters for each algorithm were found by grid search CV. However, all reported accuracies on the scoreboard were relatively low, when compared to the accuracies obtained from classifiers implemented in Matlab. Therefore, Matlab was used as the coding environment for part-1.

Kernel SVM with a Gaussian kernel was chosen as the classification method. Matlab’s ‘Classification Learner’ application was used to train the kernel SVM model. The code was manipulated to train the model on a grid of values for the box constraint C, the kernel scale , and the classification rule (one-vs-one and one-vs-all). 10-folds CV accuracy was used to select the model parameters. Only values of C; and classification rule that gave the highest cross validation accuracy on the training data and achieved an accuracy over 0.65 on the internal hold-out set were used on the test data to predict the classes labels. Variations of this approach were also tested. These included the following:

  • Reducing the feature dimension using PCA. Where only features responsible of 90% of the variance in the data were used.
  • Temporal smoothing of the voxels readings. Where a linear model (using linear regression) was used to smooth the average voxels of all subjects in event-1 (scanning time T 351) and another linear model was used to smooth the the average voxels of events-2 (scanning time T 352). Then the same fitted models are used to temporally smooth the testing data features. Figures 2, 3 show the smoothed training and testing data.
  • Adding the scanning time (events) as a feature. Where events-1 times are labeled 0 and events-2 times are labeled 1.
  • Averaging the predictions of the best performing models.

Part 2 - Unsupervised Learning

Both training and testing features from part-1 were combined and used in this part of the project. The complete data-set was composed of 1502 observations and 5903 features. PCA was used to make the predictions. The variance of the components of the complete data-set is shown in Figure 4. The plot in Figure 4 is cropped both along the x-axis and y-axis for better visualization. It can be seen from the figure that components higher than 300 have very low variance.

The process of predicting the missing features using PCA can be summarized as:

  1. The center (mean) of every training feature in the complete date-set is calculated, then the training data is centered to get Xc
  2. PCA is applied to Xc in order to find Vk by keeping the first k principle components of V. Where XTcXc = V2VT , where k is the number of components that minimizes a 10-folds CV MSE.
  3. The centered missing data-set eXc is created by initially using a 1502 X 5903 matrix of all zeros; then the provided data-set is filled in the provided index while subtracting the corresponding centers and leaving the missing values as zeros.
  4. The previously found matrix Vk is used to find the projections of the missing data-set on the principle components, fWk = eXVk.
  5. The projections fWk are then projected back to the full data size, bXc = VTfWk, and finally, the projections are de-centered.
  6. The missing features are found by extracting the features with the missing indices from bXc.

Another approach that was tested was Principle Component Regression (PCR). In PCR, regression is applied to the projections of the data-set onto the principle components [1]. To set up the problem as a PCR problem we need to clarify the terminology as:

  1. The provided index is used to extract the regressors (Xtrain), a set of 1502 observations by 3172 features. The missing indices are used to extract the dependent variables (Ytrain), which are a set of 1502 observations by 2731 features.
  2. The regression assumption is Ytrain = f (Xtrain) + error
  3. The training set is then used to predict 2731 missing features of 1000 observations (Ypredict) provided a set 1000 regressors (Xprovided).
  4. PCA is applied to reduce the dimension of Xtrain to get ~Xk = XtrainVk.
  5. Instead of applying regression techniques to the full set of provided features (which usually result in over-fitting [1]), PCR applies regression to the projection of the regressors onto the principle component ~Xk. The k value is found by CV. A range of regression techniques was used, including linear regression, Ridge regression, kernel ridge regression, Gaussian process regression and logistic regression. Regularized logistic regression was implemented in Python [5]. Regression with several values of C was performed, where C represents the inverse of the regularization strength [6]. 10-folds CV was used to choose M values of C to train M regression models. The final prediction was made by averaging predictions of the M models.

Part 3 - Semi-Supervised Learning

In this part, we first attempt to design a classification model that performs better than the ones we tried in part-1. Our proposal is to use unsupervised learning, namely K-means clustering, to cluster the 5903 voxels into 5 groups, as shown in Figure 5. Then, we smooth the brain activity within each cluster by simply adding the difference between the linear-fit to the median brain activity within each cluster and it?s corresponding median activity. The smoothed features are then use as inputs to the kernel SVM already modeled in part-1. This process is basically a spatial smoothing process done on voxel clusters obtained by K-means clustering.

Second, a multiple layer neural network was designed and tested. Matlab?s ?Neural Network Toolbox? was used to design, train and test the network. A network of 10 hidden neurons was designed as shown in Figure 6. We attempted to apply a self-learning algorithm, where the predictions of the best classifier in part-1 were used as labels for the test data. Both the training and labeled test data were fed to the neural network.

Results and discussion

Part 1 - Supervised Learning

Table 1 summarizes the results of part-1. Although a total of 85 submissions were made on Autolab, the table highlights the most significant submissions.

The first column in Table 1 refers to the Autolab submission version. Table 1 shows our journey with overfitting. After submitting almost 80 submissions, we ended up with the same accuracy we started with. It?s also worth noting that our previous to the last submission (#84) could have substantially increased our final grade. However, we based our judgment on both the CV accuracy and the hold-out accuracy which are the highest for the last submission. Table 1 shows that temporal smoothing with kernel SVM achieves the highest classification accuracy. It makes sense because temporal smoothing reduces features noise. One justification for the presence of noise based on the scanning time is that a subject setting in the fMRI machine for a long time would respond differently to the task. Another justification is that the fMRI machine itself would generate different responses if run for a long time due to heat and saturation conditions. No matter what causes the presence of noise in the features, temporal smoothing reduces it?s effect giving the classifier a better chance to making more accurate predictions. Figure 7 shows the confusion matrix of the classifiers of the last two submissions. It can be seen that the increase in the classification accuracy for the temporal smoothing is due to improving the prediction of class 1.

Part 2 - Unsupervised Learning

Table 2 summarizes the results of part-2. The table highlights the most significant submissions.

Table 2 indicates that although ridge PCR showed a good performance exhibiting a lower MSE during CV, logistic PCR had the lowest MSE and thus the best performance. CV, again, made the choice of the model difficult. Logistic PCR yielded an accuracy of 0.4760 on hidden test data, which suggests that we have successfully avoided over fitting, where MSE values for CV, hold-out, full data and test data are all in the same range. We think the good performance of the logistic PCR is due to the fact that only 15 components are used, which minimized the chances of overfitting.

Part 3 - Semi-Supervised Learning

Spatial smoothing was applied to the training and testing feature clusters and a Gaussian kernel SVM was used to train the classifier. Using C = 50; = 500 and one-vs-one voting, an accuracy of 0.6201 was achieved on the full data-set. Spatially smoothing the features improves the classifier?s performance, however, the achieved accuracy is just comparable to the temporal smoothing case and not higher. The results of using self-training SVM with neural network seemed promising. The self-taught algorithm [4] is an iterative learner such that, in each iteration only labels predicted with highest confidence are fed back to the learner. For our experiment however, only a single iteration was used without excluding any labels, this resulted in an increased accuracy of 0.6274.


In this project, we examined many of the machine learning techniques learned in class. Both supervised and semi-supervised methods have been investigated to classify and to predict data. The bias-variance tradeoff turned out to be the major difficulty. Having limited and small training data makes it hard to decide on a threshold point before overfitting. Although we have used cross validation to prevent falling into the trap of overfitting, we still fooled ourselves in part 1. CV helps narrow the choices of models, however it never guarantees that the model will successfully generalize to independent data set. The main issue with CV is the randomness of it?s folds. Running the same experiments twice will not generate the same CV values. We also have seen that unsupervised learning methods on their own cannot deliver a good model. However, semi-supervised methods seem to strike a balance between supervised and unsupervised models.


[1] Ian T Jolliffe. A note on the use of principal components in regression. Applied Statistics, pages 300-303, 1982.

[2] Eric Jones, Travis Oliphant, and Pearu Peterson. Scipy: Open source scientific tools for python. 2014.

[3] Wes McKinney. Data structures for statistical computing in python. In Proceedings of the 9th, volume 445, pages 51-56, 2010.

[4] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759-766. ACM, 2007.

[5] S-l D Scikit. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825-2830, 2011.

[6] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22-30, 2011.

blog comments powered by Disqus


18 December 2015