Each project will have the same data set as starting point. You will perform non-parametric clustering on these data. You will use an assigned method as well as a second method of your choice for this task, and do all you can to obtain a good clustering.
There is no single correct answer for these data. You will have to describe what you aimed for/what you think you found by the methods you used.
You will have to submit a report (approximately 10 pages) about what you did, submit your code (excluding the packages you may have used)[, and OPTIONALLY TBD make a 1-2 slide summary. Besides the written project, you will have a short presentation 1-2 minutes followed by 1-2 questions from the instructors and the audience.] Week 9 we will open a site where you will enter your clusterings, which I and Jess will evaluate. In the last class, we'll unveil and compare the results.
Data set The data set is available on canvas under "Files Project" folder. It has n=12,000 data points in d=64 dimensions.
Methods for clusteringYou need to use two different algorithms to cluster your data. You need to
register one of the algorithms (more instructions later) by May 10. Below is a list of possible algorithms. No matter what method you choose, you are responsible for knowing how this method works, and for explaining how you chose parameters for clustering. Demonstrating that you understand how to use the clustering methods is the most important goal of this project.
List of NP Clustering algorithms for Method 1
- Mean-shift (not Gaussian Blurring Mean Shift) max 8 students
- Level set method with Kernel density estimator -- only 1 level clustering (no cluster tree), you choose the level, decide what labels to give to the points not clustered (e.g. assign them to the nearest cluster, or make each a singleton cluster) max 8 students
- Level set + K-nearest neighbor graphs max 4 students
- DBscan
- Random Forests
- Dirichlet Process Mixtures -- you must understand the algorithm enough to explain how you use it and to interpret the results
Method 2 can be any method, either from this list or elsewhere. The reason there are 2 methods is that with Method 1, we assure some diversity in the pool of methods, so that we have more methods to compare. With method 2, you are free to explore whatever method you find fit for the data, or are curious about. Extra credit will be given for working with methods not in class (~2%). You can choose Method 2 to be the same class as Method 1. When Method 1 and Method 2 are of the same class, the methods must be run with different parameters or different data transformations. In addition, to get full credit, you must explain how you chose the two sets of parameters/transformations (i.e. if the first set of parameters is best/good in some way, then the second set must be best/good in some different way, and you must make a credible case for it; it can't be just some random other parameter setting).
Grading of results will take into account only the "best" of your 2 clusterings. Hence, if you had to choose a suboptimal Method 1, you can be free with Method 2. (That means that if you get a particularly bad clustering by one method, it will not count against you).
For any method, you should explore the data first. You can do some preprocessing. In particular, you can derive new features from the existing ones, or you can define a particular type of "distance". In addition, whenever it makes sense, it is highly recommended that you also use the raw features with the same clustering method, for comparison. Tell us in the project how the raw features fared compared to the features "engineered" by you.
Software resources You are allowed to download software for this project. In this case, you must know intimately what the software is doing in the context of your project. You must also demonstrate by your project that you mastered various issues of the process of data analysis and clustering. You will be graded mostly (this will become more precise eventually) on your intellectual contribution to the project and only secondarily on the performance/sophistication of the methods borrowed from others.
Generic machine learning packages
- scikit-learn (Python)
- Weka (Java)
- pmtk3 (matlab)
- SVM packagages: SVM-torch, SVM-light, LibSVM
- TensorFlow (especially for neural nets, but other methods included) (Python)
- more TB posted
Time line
Data available |
May 2 |
Choose method 1 |
Outputs due |
May 23 noon |
Unveiling the results |
May 31 lecture |
Submit report |
June 5 midnight |
Report outline
- Preprocessing, what feature set you used
- Methods. What methods, why (in case of method 2) with some details (e.g. kernel type)
- Clustering strategy. Reproducible description of what you did for clustering. How did you set parameters, how did you set algorithm's parameters (e.g. convergence), with supporting evidence (e.g. plotted density and found it too rough -- this would work for data in 1 dimension).
- Evaluation: how good is the clustering and how did you evaluate this? Here we expect more than one method if they are interpretable and plottable. Be selective in what you show! Credit will given for careful analysis or visualization of the results.
- Bibliographic references as needed (the class notes need not be references, only other material you read)
- Total length: no more than 5 pages of text (not counting references and figures) and up to no more than 10 pages total. [The 5 pages are approximative. Do not try to separate text from figures. In fact, we encourage you to place figures in text at the most natural location; this makes reading easier.]
In writing the report, assume that the readers (=instructor and TA) are very familiar with all the clustering methods presented in class and with machine learning terminology; there is no need to reproduce textbook like defintions (and there would be no space for it). What the readers need to know are the specifics of what you did with these methods. What parameters you used for learning, what inputs, and if there were any variations from the standard methods. For example, if you use a Random Forests package, although we know what a RF is, assume we don't know what variant of RF the package implements, or what the parameters mean. You need to specify these in your report.
How to submit your results
- for the 12,000 examples in the data, you will produce 12,000 cluster labels by each method. Create the files alg1_out.txt and alg2_out.txt with the following format:
label1
label2
....
label
....
(12,000 lines)
E.g. example-output.txt
The label is an integer, representing the cluster label; the ordering of the labels must match the ordering of the points in the data set.
The labels can be any non-negative integer, as long as points in the same clusters have the same label.
We will use a script to download and evaluate your results, so please do not vary from the format or file name, lest your results be distorted. There will be two files to submit, one for each method.
- Go to the Canvas dropbox. Upload the files.
- T.B. set upEnter also: the names of the methods used.