Statistical Computing Learning
STAT 534 Spring Quarter 2019

Home

Course Description

Syllabus

Books and other resources

Class mailing list

 

Assignments

Handouts/Course notes

UW Statistics

Project

[ Generalities ] [ Data sets ] [ What to do ] [ Code ] [ Hints ] [ Time line ] [ Report ] [ Results ] [ Grading ]
Each project will have the same training set as starting point. You will fit a Hidden Markov Model to the training data.

You will have to submit a report about what you did, submit your code, and submit results on a test set to be provided later. We will evaluate and compare the results, and unveil the winners and the data source during exam week.

Data sets The data set for training is available here. It contains n0=1000 sequences of T=40 integers in 0,1,2,3 (one sequence per line, space separated). These sequences have been sampled independently from an unknown distribution. Later, we will post an unlabeled test set, on which you will test the model you obtained. Find the test set here: test1_534.dat. It contains 50 sequences of length 40.

What to do

  1. You will use the data made available to train an HMM by the Baum-Welch method. Choose an appropriate number of states N. We suggest N=5 or larger, but it is possible that M=4 or smaller works, although we did not try them.
  2. A rigurous way to choose N is by Cross-Validation. In this case you will have to set aside some part of the training set for validation, or use K-fold CV (K=5 is a good choice).
  3. Number the states from 0 to N-1, so that the state with the most uniform emission probability is labeled 0. If possible, give states with the most deterministic emission probabilities highest numbers. (E.g. if 1 is emitted w.p. 99% in state q, label this state N-1, if 3 is emitted w.p. 80% in state q', label this state N-2.)
    Example: In state 0, emission probabilities are [0.3  0.3  0.2 0.2]; in state 1 they are [0.5   0.2 0.3 0]; in state 2 they are [0.9   0.1   0  0]; etc
  4. For a given training set of your choice, with n samples, you will write the (log-)likelihood of this set. Then you will use the B-W algorithm to maximize this log-likelihood.Deleted: Derive/explain this in the project report.
  5. Report the transition matrix A, and the emission probability matrix B.
  6. Use the Viterbi algorithm to calculate the most likely sequence of states for all the data. Report the time it took to run the Viterbi algorithm on all the data.
  7. Write code that, given a sequence of outputs, predicts the next output and the next state, and test it on the data.
  8. Write code that calculates the log-likelihood of the test set (by Forward-Backward).
  9. When the test set is available, you will be asked to perform some of these tasks on it:
    • output the log-likelihood of the test set
    • output the Viterbi sequence of states
    • predict the next Deleted:state and output for a given sequence.

Software resources For full credit, you should implement all the algorithms yourself. But you can also choose to use HMM software (in python) downloaded from standard packages (but not software written by your peers or other students at UW). In this case, you must know intimately what the software is doing in the context of your project; more precisely, you must explain how you set the parameters, and report which software you used. Do not submit downloaded software with your code.

Good practices

Time line
Data available May 25
Test set available June 10
Test results due 24h after test set
Award ceremony virtual June 14 or later
Submit report June 12 midnight
Report outline
  • Training: how much data used, run time of Baum-Welch, run time/iteration, how did you choose stopping
  • How did you choose M? What range of values did you try? If cross-validation, how large was the validation set. If other methods or additional validation, what were they?
  • What software did you use? How did you set the parameters?
  • Experimental results, e.g learning curve(s), training (validation) losses, parameters A, B. Be selectivein what you show! Full credit will given for legible well marked and clearly explained plots and results.
  • The formula(s) you used to predict next output.
  • At least one paragraph of remarks and comments about the results and data.
  • Optional: references
  • Total length: no more than 5 pages of contents, with extra pages containing references or figures, up to no more than 10 pages total.
How to submit your test set results
  • Instructions TB Posted
    We will use a script to download and evaluate your results, so please do not vary from the given formats or file names, lest your results be distorted.
  • Go to the Canvas dropbox. Upload the files.
Grading
Report 55 = 20 presentation + 35 for analysis; HMM code implementation 25; Test set results = 20. Total = 100.

 


Marina Meila
Last modified: Sat Nov 30 15:43:52 PST 2013