CSE 547/STAT 548

Machine Learning for Big Data
CSE 547/STAT 548 Winter Quarter 2022

Home

Course Description

Books and other resources

Class mailing list

Assignments

Handouts/Course notes

UW Statistics

Project

[ Generalities ]

Generalities Each project will have the same data set as starting point. You will perform (approximate) nearest neighbor (NN) search on these data by your methods of choice. You will do all you can to obtain accurate and fast nearest neighbors in a given data base (also referred to as training set).
A few days before the last day of classes, we will provide a "query set". The query set will consist of a number of m query points, for which you will find the points in the training set within radius rr, in increasing order of the distance to the query point. You will run your method(s) on the query set and submit the results, which I and the TAs will evaluate against the truth.
You will have to submit a report (no more than 10 pages) about what you did, submit your code (excluding the packages you may have used), and [TBD] make a 1-2 slide summary. Besides the written project, if class time permits, you may have a short presentation 1-2 minutes followed by 1-2 questions from the instructors and the audience. In the same lecture, we'll unveil and compare the results.

Data sets The data is a subset of the word2vec
Note that the dataset is presented in a pickle format to reduce file sizes. Use these data as you wish to obtain your NN search methods. You are encouraged to consult the original data web site for more information, and explore the data. But you are not allowed to download the original data or to try to compare the data sets provided with the original data.

Methods for NN search The goal of this project is trading accuracy for time in implementing NN search. Hence, choose and optimize an approximate NN method, from those presented in this course. You are allowed to use other approximate NN methods, as well as to use existing open source implementations of NN search. Your project will be graded partly on the practical success of your code (more about the scoring to follow) and on your understanding and reasoning about the task, as explained in your report.
In particular, no matter what method you choose, you are responsible for knowing how this method works, and for explaining how you chose parameters for training. Demonstrating that you understand how to perform NN search is the most important goal of this project.

Software resources You are allowed to download software for this project. For any ML software you use, you must knowwhat the software is doing in the context of your project. The grade will weigh by 80% the quality of your report, and your intellectual contribution to the project and only secondarily (20%)on the performance/sophistication of the methods borrowed from others.

Generic machine learning packages (TB UPDATED)

Report outline

Data exploration. What data properties influenced your choice of NN method?

Methods(s): Describe the methods you used, in enough detail to make the work reproducible. Assume that the readers know the methods already (e.g. you do not need to describe the LSH method) but they do not know what variant of the method you used, and what parameters. In other words, your report need not reproduce course materials, they need to complement them with the specifics of what you did. For full grade, please define any parameters which are not defined in the course materials.

Strategy. Reproducible description of how you traded off time, accuracy, effort from your part, etc. What was your reasoning? If you tried multiple approaches, what worked and what failed? Be explicit in describing the kind of failure (or success) you observed. Note: many of you encountered problems with Colab. Mention these since they influenced your choice and your results; but remember they are not the focus of the project.

Experimental results. Some of these should reflect/intersect with training strategy. This part is your analytical work to validate the methods you developed. The results should go beyond running your method on the given test set. For example, describe what experiment s you did to optimize your method. How did the results influence your final algorithm? Extract featurs of your methods or data that are interpretable and plottable. Be selectivein what you show! Credit will given for careful analysis or visualization of the results.

Estimate of the average retrieval metrics: time, false positive and false negative rate . Optionally, prediction intervals where you believe your metrics will be, and how you estimated these.

Optional: references

Total length: no more than 5 pages of contents, with extra pages containing references or figures, up to no more than 10 pages total. Do not include code in report.
In writing the report, assume that the readers (=instructor and TA) are very familiar with all the methods and with CS/machine learning terminology; there is no need to reproduce textbook like defintions (and there would be no space for it). What the readers need to know are the specifics of what you did with these methods and why. What parameters you used, what inputs (if applicable), and if there were any variations from the standard methods. For example, if you use a software package, assume we don't know what variant of the "textbook" method the package implements, or what the parameters mean. You need to specify these in your report.

How to submit your query set results

Instructions t.b.posted on canvas

We will use a script to download and evaluate your results, so please do not vary from the format and file name, lest your results be distorted.

Scoring of your results is described on canvas (see Project module). It takes into account the rates of false positives, false negatives and time elapsed.

Time line TBD

Data available Feb 18

First analysis (Hw 5) Feb 24

Query set available cca March 7
Results due cca March 9, noon
Award ceremony March 10 lecture
Submit report March 15 midnight

Last modified: Sat Nov 30 15:43:52 PST 2013