Each project will have the same training set as starting point. You
will perform 10 class classification on this data.
You will train a predictor of your choice for this task, and do all you can to obtain a low expected classification error.
You will have to submit a report (approximately 10 pages) about what
you did, submit your code (excluding the packages you may have used),
and make a 1-2 slide summary. Besides the written project, you will have a short presentation 1-2 minutes followed by 1-2 questions from the instructors and the audience. A few days before the last day of classes, we will
provide a test set with hidden labels. You will run your predictor on
the test set and submit the results, which I and Zhaoqi will
evaluate. In the same class, we'll unveil and compare the results.
Data sets The data is a subset of CIFAR 10, containing 40,000 examples divided approximately equally into 10 classes.
The data sets for training are available on canvas under "Files Project" folder. It is a subset of size 40,000 of the CIFAR-10 dataset with some preprocessing. Each line represents a 32*32 image with 3 RGB channels. Note that the dataset is presented in a pickle format to reduce file sizes. To read these files and convert them into txt format (so that they are readable in R), please run the script unpack_files.py which is also posted in the same folder (the package numpy for Python is required for running this script). More instructions will come later.
Use these data as you wish to obtain your predictor.
Later, we will post an unlabeled test set, with the same format, on which you will test the
predictor you obtained.
Methods for classification You will use the data
made available to construct your predictor. You need to register (more instructions later) by Nov 19.
Below is a list of possible predictors.
find the list, with short clarifications for each model. No matter what method you choose, you are responsible for knowing how this method works, and for explaining how you chose parameters for training. Demonstrating that you understand how to use a predictor is the most important goal of this project.
List of Predictors
- Decision tree: single decision tree with branches predicting labels 0...9
- Bagged decision tree: an ensemble of decision trees obtained by either randomizing the construction of the trees, or by resampling the training set
- Neural net >=2 layers: a multilayer neural network, with 10 outputs, [optionally softmax on the outputs]
- [Boosting: a boosted multiclass weak classifier of your choice not included this Fall ]
- K-NN: K-nearest neighbors [you can choose the distance]
- Naive Bayes [you can choose the features]
- SVM - RBF kernel: multiclass SVM with the Gaussian kernel
- SVM - other: multiclass SVM with a different kernel
- Logistic regression: 1 logistic regression for each class, [optional softmax output]
- Generative model: train a P(X|digit) model separately for each class, predict the class label by Bayes' rule
- ECOC: Error correcting output codes. Train a set of binary classifiers (any choice) and obtain the class label from their combination.
- Multiclass from binary 1: Train a set of binary classifiers as above and obtain the class label from their combination, by a method other than ECOC.
-
For any method, you should explore the data first, and do some
preprocessing. In particular, you can derive new features from the
existing ones, or you can define a particular type of "distance" in
the space of images. In addition, whenever it makes sense, it is
highly recommended that you also use the raw features in the same
classifier, for comparison. Tell us in the project how the raw features fared compared to the features "engineered" by you.
Software resources You are allowed to download
software for this project. In this case, you must know intimately what
the software is doing in the context of your project. You must also
demonstrate by your project that you mastered various issues of the
process of data analysis/prediction. You will be graded mostly (this
will become more precise eventually) on your intellectual contribution
to the project and only secondarily on the performance/sophistication of the
methods borrowed from others.
Generic machine learning packages
- scikit-learn (Python)
- Weka (Java)
- pmtk3 (matlab)
- SVM packagages: SVM-torch, SVM-light, LibSVM
- TensorFlow (especially for neural nets, but other methods included) (Python)
- more TB posted
Time line