STAT 535

Foundations of Machine Learning
STAT 535 Autumn Quarter 2021

Home

Course Description

Syllabus

Books and other resources

Class mailing list

Assignments

Handouts/Course notes

UW Statistics

Project

PAGE UNDER CONSTRUCTION
Each project will have the same training set as starting point. You will perform regression on this data, to predict the power of each turbine. You will train a predictor of your choice for this task, and do all you can to obtain a low expected prediction error.

You will have to submit a report (no more than 10 pages) about what you did, submit your code (excluding the packages you may have used), and make a 1-2 slide summary. Besides the written project, if class time permits, you may have a short presentation 1-2 minutes followed by 1-2 questions from the instructors and the audience. A few days before the last day of classes, we will provide a test set with hidden labels. You will run your predictor on the test set and submit the results, which I and Zhaoqi will evaluate. In the same class, we'll unveil and compare the results.

Data sets The data is a subset of the US4 Wind Turbines data set, containing about 70,000 wind turbines, with their characteristics. The data sets for training are X_train.pkl, y_train.pkl They contain 50,000 examples of the original dataset with some preprocessing.Specifically, small noise was added to a number of features, and other features have been removed. Each line in the inputs file represents the features of one turbine. Your task is to predict the turbine capacity from the available input features. The loss function is the squared error L_LS.
Note that the dataset is presented in a pickle format to reduce file sizes. To read these files and convert them into txt format (so that they are readable in R), please run the script unpack_files.py which is also posted in the same folder (the package numpy for Python is required for running this script). More instructions will come later. Use these data as you wish to obtain your predictor. You are encouraged to consult the original data web site for more information about wind turbines, or to see plots of data samples. But you are not allowed to download the original data or to try to compare the data sets provided with the original data. Later, we will post an unlabeled test set, with the same format, on which you will test the predictor you obtained.

Methods for prediction You will use the data made available to construct your predictor. You need to register (more instructions later) by Nov 22 end of day. Below is a list of possible predictors. find the list, with short clarifications for each model. No matter what method you choose, you are responsible for knowing how this method works, and for explaining how you chose parameters for training. Demonstrating that you understand how to use a predictor is the most important goal of this project.

List of Predictors

Regression tree: single decision tree with branches predicting labels 0...9

Bagged regression tree/Random Forest: an ensemble of decision trees obtained by either randomizing the construction of the trees, or by resampling the training set

Neural net >=2 layers: a multilayer neural network

K-NN: K-nearest neighbors [you can choose the distance]

SVM - regression SVM with your choice of kernel

Linear Least Squares regression -- you can include higher order terms

Kernel regression: Nadaraya Watson, or higher order (e.g. local linear regression)

For any method, you should explore the data first, and do some preprocessing. In particular, you can derive new features from the existing ones, or you can define a particular type of "distance" in the space of images. In addition, whenever it makes sense, it is highly recommended that you also use the raw features in the same predictor, for comparison. Tell us in the project how the raw features fared compared to the features "engineered" by you.

Software resources You are allowed to download software for this project. For any ML software you use, you must know intimately what the software is doing in the context of your project. You must also demonstrate by your project that you mastered various issues of the process of data analysis/prediction. You will be graded mostly (this will become more precise eventually) on your intellectual contribution to the project and only secondarily on the performance/sophistication of the methods borrowed from others.

Generic machine learning packages

scikit-learn (Python)
Weka (Java)
pmtk3 (matlab)
SVM packagages: SVM-torch, SVM-light, LibSVM
TensorFlow (especially for neural nets, but other methods included) (Python)
more TB posted

Time line

Data available	Nov 19
Choose method	Nov 22
Test set available	Dec 6 noon
Test results due	Dec 8, noon
Award ceremony	Dec 9 lecture
Submit report	Dec 15 midnight

Report outline

Preprocessing, what feature set you used
Predictor(s): complete model description, parametrization
Basic training algorithm(s): what algorithm, what parameters, anything unusual you did. Do not reproduce the algorithms from books or lecture unless you make modifications.
Training strategy. Reproducible description of what you did for training (e.g training set sizes, number epochs, how initialized, did you do CV or model selection)
Experimental results, e.g learning curve(s), training (validation) losses, estimated parameter values if they are interpretable and plottable. Be selectivein what you show! Credit will given for careful analysis or visualization of the results.
Estimate of the average prediction squared error L_LS. Optionally, an interval [Lmin, Lmax] where you believe L will be, and how you estimated these.
Optional: references
Total length: no more than 5 pages of contents, with extra pages containing references or figures, up to no more than 10 pages total.

In writing the report, assume that the readers (=instructor and TA) are very familiar with all the predictors and with machine learning terminology; there is no need to reproduce textbook like defintions (and there would be no space for it). What the reader need to know are the specifics of what you did with these predictors. What parameters you used for learning, what inputs, and if there were any variations from the standard methods. For example, if you use a Random Forests package, although we know what a RF is, assume we don't know what variant of RF the package implements, or what the parameters mean. You need to specify these in your report. How to submit your test set results

for the ~15,000 examples in the test set, you will produce ~15,000 labels with values output by your Predictor. Create the file y.out with the following format:
predicted_mean_square_error
yhat_example_1
yhat_example_2
....
(15001 lines)
E.g. y.out
For the value of predicted_mean_square_error you should input your best guess of how your method will perform on the training set.
We will use a script to download and evaluate your results, so please do not vary from these format or file name, lest your prediction error be distorted.
Go to the Canvas dropbox. Upload the files.
T.B. set upEnter also: the names of the method used by your Predictor.

Marina Meila

Last modified: Sat Nov 30 15:43:52 PST 2013