Machine Learning for Big Data
CSE 547/STAT 548 Winter Quarter 2022

Home

Course Description

Syllabus

Books and other resources

Class mailing list

 

Assignments

Handouts/Course notes

UW Statistics

Project

[ Generalities ] [ Data sets ] [ Methods ] [ Software ] [ Time line ] [ Report ] [ Results ]

Generalities Each project will have the same data set as starting point. You will perform (approximate) nearest neighbor (NN) search on these data by your methods of choice. You will do all you can to obtain accurate and fast nearest neighbors in a given data base (also referred to as training set).
A few days before the last day of classes, we will provide a "query set". The query set will consist of a number of m query points, for which you will find the points in the training set within radius rr, in increasing order of the distance to the query point. You will run your method(s) on the query set and submit the results, which I and the TAs will evaluate against the truth.
You will have to submit a report (no more than 10 pages) about what you did, submit your code (excluding the packages you may have used), and [TBD] make a 1-2 slide summary. Besides the written project, if class time permits, you may have a short presentation 1-2 minutes followed by 1-2 questions from the instructors and the audience. In the same lecture, we'll unveil and compare the results.

Data sets The data is a subset of the word2vec
Note that the dataset is presented in a pickle format to reduce file sizes. Use these data as you wish to obtain your NN search methods. You are encouraged to consult the original data web site for more information, and explore the data. But you are not allowed to download the original data or to try to compare the data sets provided with the original data.

Methods for NN search The goal of this project is trading accuracy for time in implementing NN search. Hence, choose and optimize an approximate NN method, from those presented in this course. You are allowed to use other approximate NN methods, as well as to use existing open source implementations of NN search. Your project will be graded partly on the practical success of your code (more about the scoring to follow) and on your understanding and reasoning about the task, as explained in your report.
In particular, no matter what method you choose, you are responsible for knowing how this method works, and for explaining how you chose parameters for training. Demonstrating that you understand how to perform NN search is the most important goal of this project.

Software resources You are allowed to download software for this project. For any ML software you use, you must knowwhat the software is doing in the context of your project. The grade will weigh by 80% the quality of your report, and your intellectual contribution to the project and only secondarily (20%)on the performance/sophistication of the methods borrowed from others.

Report outline

Total length: no more than 5 pages of contents, with extra pages containing references or figures, up to no more than 10 pages total. Do not include code in report.
In writing the report, assume that the readers (=instructor and TA) are very familiar with all the methods and with CS/machine learning terminology; there is no need to reproduce textbook like defintions (and there would be no space for it). What the readers need to know are the specifics of what you did with these methods and why. What parameters you used, what inputs (if applicable), and if there were any variations from the standard methods. For example, if you use a software package, assume we don't know what variant of the "textbook" method the package implements, or what the parameters mean. You need to specify these in your report.

How to submit your query set results

Time line TBD
Data available Feb 18
First analysis (Hw 5) Feb 24
Query set available cca March 7
Results due cca March 9, noon
Award ceremony March 10 lecture
Submit report March 15 midnight


 


Marina Meila
Last modified: Sat Nov 30 15:43:52 PST 2013